No title - PDF Free Download

Communicated by Lawrence Jackel VIEW Neural Networks and the BiadVariance Dilemma Stuart Geman Division of Applied Ma...

Author: MIT Press

7 downloads 831 Views 26MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Communicated by Lawrence Jackel

VIEW

Neural Networks and the BiadVariance Dilemma Stuart Geman Division of Applied Mathematics, Brown University, Providence, XI 02912 USA

Elie Bienenstock RenC Doursat ESPCI, 10 rue Vuuquelin, 75005 Paris, France

Feedforward neural networks trained by error backpropagation are examples of nonparametric regression estimators. We present a tutorial on nonparametric inference and its relation to neural networks, and we use the statistical viewpoint to highlight strengths and weaknesses of neural models. We illustrate the main points with some recognition experiments involving artificial data as well as handwritten numerals. In way of conclusion, we suggest that current-generation feedforward neural networks are largely inadequate for difficult problems in machine perception and machine learning, regardless of parallelversus-serial hardware or other implementation issues. Furthermore, we suggest that the fundamental challenges in neural modeling are about representation rather than learning per se. This last point is supported by additional experiments with handwritten numerals. 1 Introduction

Much of the recent work on feedforward artificial neural networks brings to mind research in nonparametric statistical inference. This is a branch of statistics concerned with model-free estimation, or, from the biological viewpoint, tabula rasa learning. A typical nonparametric inference problem is the learning (or “estimating,” in statistical jargon) of arbitrary decision boundaries for a classification task, based on a collection of labeled (pre-classified) training samples. The boundaries are arbitrary in the sense that no particular structure, or class of boundaries, is assumed a priori. In particular, there is no parametric model, as there would be with a presumption of, say, linear or quadratic decision surfaces. A similar point of view is implicit in many recent neural network formulations, suggesting a close analogy to nonparametric inference. Of course statisticians who work on nonparametric inference rarely concern themselves with the plausibility of their inference algorithms Neural Computation 4,l-58 (1992)

@ 1992 Massachusetts Institute of Technology

2

S. Geman, E. Bienenstock, and R. Doursat

as brain models, much less with the prospects for implementation in “neural-like” parallel hardware, but nevertheless certain generic issues are unavoidable and therefore of common interest to both communities. What sorts of tasks for instance can be learned, given unlimited time and training data? Also, can we identify “speed limits,” that is, bounds on how fast, in terms of the number of training samples used, something can be learned? Nonparametric inference has matured in the past 10 years. There have been new theoretical and practical developments, and there is now a large literature from which some themes emerge that bear on neural modeling. In Section 2 we will show that learning, as it is represented in some current neural networks, can be formulated as a (nonlinear) regression problem, thereby making the connection to the statistical framework. Concerning nonparametric inference, we will draw some general conclusions and briefly discuss some examples to illustrate the evident utility of nonparametric methods in practical problems. But mainly we will focus on the limitations of these methods, at least as they apply to nontrivial problems in pattern recognition, speech recognition, and other areas of machine perception. These limitations are well known, and well understood in terms of what we will call the bias/variance dilemma. The essence of the dilemma lies in the fact that estimation error can be decomposed into two components, known as bias and variance; whereas incorrect models lead to high bias, truly model-free inference suffers from high variance. Thus, model-free (tabula vasa) approaches to complex inference tasks are slow to “converge,” in the sense that large training samples are required to achieve acceptable performance. This is the effect of high variance, and is a consequence of the large number of parameters, indeed infinite number in truly model-free inference, that need to be estimated. Prohibitively large training sets are then required to reduce the variance contribution to estimation error. Parallel architectures and fast hardware do not help here: this ”convergence problem” has to do with training set size rather than implementation. The only way to control the variance in complex inference problems is to use model-based estimation. However, and this is the other face of the dilemma, model-based inference is biasprone: proper models are hard to identify for these more complex (and interesting) inference problems, and any model-based scheme is likely to be incorrect for the task at hand, that is, highly biased. The issues of bias and variance will be laid out in Section 3, and the “dilemma” will be illustrated by experiments with artificial data as well as on a task of handwritten numeral recognition. Efforts by statisticians to control the tradeoff between bias and variance will be reviewed in Section 4. Also in Section 4, we will briefly discuss the technical issue of consistency, which has to do with the asymptotic (infinite-training-sample) correctness of an inference algorithm. This is of some recent interest in the neural network literature. In Section 5, we will discuss further the bias/variance dilemma, and

Neural Networks and the Bias/Variance Dilemma

3

relate it to the more familiar notions of interpolation and extrapolation. We will then argue that the dilemma and the limitations it implies are relevant to the performance of neural network models, especially as concerns difficult machine learning tasks. Such tasks, due to the high dimension of the “input space,” are problems of extrapolation rather than interpolation, and nonparametric schemes yield essentially unpredictable results when asked to extrapolate. We shall argue that consistency does not mitigate the dilemma, as it concerns asymptotic as opposed to finitesample performance. These discussions will lead us to conclude, in Section 6, that learning complex tasks is essentially impossible without the a priori introduction of carefully designed biases into the machine’s architecture. Furthermore, we will argue that, despite a long-standing preoccupation with learning per se, the identification and exploitation of the ”right” biases are the more fundamental and difficult research issues in neural modeling. We will suggest that some of these important biases can be achieved through proper data representations, and we will illustrate this point by some further experiments with handwritten numeral recognition.

2 Neural Models and Nonparametric Inference

2.1 Least-Squares Learning and Regression. A typical learning problem might involve a feature or input vector x, a response vector y, and the goal of learning to predict y from x, where the pair (x,y) obeys some unknown joint probability distribution, P. A training set (xl, y,), . . . , (XN,y ~ ) is a collection of observed (x, y) pairs containing the desired response y for each input x. Usually these samples are independently drawn from P, though many variations are possible. In a simple binary classification problem, y is actually a scalar y E (0, l}, which may, for example, represent the parity of a binary input string x E (0, l}’,or the voiced/unvoiced classification of a phoneme suitably coded by x as a second example. The former is ”degenerate” in the sense that y is uniquely determined by x, whereas the classification of a phoneme might be ambiguous. For clearer exposition, we will take y = y to be one-dimensional, although our remarks apply more generally. The learning problem is to construct a function (or “machine”) f(x) based on the data ( x I , y l ) ,. . . (XN, y ~ )so, that f(x) approximates the desired response y. Typically, f is chosen to minimize some cost functional. For example, in feedforward networks (Rumelhart et al. 1986a,b), one usually forms the sum of observed squared errors,

S. Geman, E. Bienenstock, and R. Doursat

4

and f is chosen to make this sum as small as possible. Of course f is really parameterized, usually by idealized “synaptic weights,” and the minimization of equation 2.1 is not over all possible functions f, but over the class generated by all allowed values of these parameters. Such minimizations are much studied in statistics, since, as we shall later see, they are one way to estimate a regression. The regression of y on x is E[y I XI,that is, that (deterministic) function of x that gives the mean value of y conditioned on x. In the degenerate case, that is, if the probability distribution P allows only one value of y for each x (as in the parity problem for instance), E[y I x] is not really an average: it is just the allowed value. Yet the situation is often ambiguous, as in the phoneme classification problem. Consider the classification example with just two classes: “Class A“ and its complement. Let y be 1 if a sample x is in Class A, and 0 otherwise. The regression is then simply

E [y I x] = P (y = 1 I x)

=P

(Class A I x)

the probability of being in Class A as a function of the feature vector x. It may or may not be the case that x unambiguously determines class membership, y. If it does, then for each x, Ely I x] is either 0 or 1: the regression is a binary-valued function. Binary classification will be illustrated numerically in Section 3, in a degenerate as well as in an ambiguous case. More generally, we are out to “fit the data,” or, more accurately, fit the ensemble from which the data were drawn. The regression is an excellent solution, by the following reasoning. For any function f(x), and any fixed x,’

E [(Y

- f W 21x1

[((Y - Ek I XI) + ( E b I XI -f(X)))2 I XI E [(Y - ElY I I x] + (El!/ I XI -f(x))’ + 2E [(Y - E k I XI) I xl . (Eb I XI - f ( x ) ) E [ (Y - El!/ I I x] + (ELy I XI - fW2 + 2 Fly I XI - E b I XI) . ( E [ y I XI -f(x)) E [(Y - E b I XI)* I x] + (Eb I XI -f(x))’

= E =

=

=

(2.2)

2 E [(Y - ELy I xu2 I x] In other words, among all functions of x, the regression is the best predictor of y given x, in the mean-squared-error sense. Similar remarks apply to likelihood-based (instead of least-squaresbased) approaches, such as the Boltzmann Machine (Ackley et al. 1985; Hinton and Sejnowski 1986). Instead of decreasing squared error, the ‘For any function 4(x,y), and any fixed x, E[$(x,y) I x] is the conditional expectation of 4(x,y) given x, that is, the average of d(x,y) taken with respect to the conditional probability distribution P(y I x).

Neural Networks and the Bias/Variance Dilemma

5

Boltzmann Machine implements a Monte Carlo computational algorithm for increasing likelihood. This leads to the maximum-likelihood estimator of a probability distribution, at least if we disregard local maxima and other confounding computational issues. The maximum-likelihood estimator of a distribution is certainly well studied in statistics, primarily because of its many optimality properties. Of course, there are many other examples of neural networks that realize well-defined statistical estimators (see Section 5.1). The most extensively studied neural network in recent years is probably the backpropagation network, that is, a multilayer feedforward network with the associated error-backpropagation algorithm for minimizing the observed sum of squared errors (Rumelhart et al. 1986a,b). With this in mind, we will focus our discussion by addressing least-squares estimators almost exclusively. But the issues that we will raise are ubiquitous in the theory of estimation, and our main conclusions apply to a broader class of neural networks.

2.2 Nonparametric Estimation and Consistency. If the response variable is binary, y E {O,l}, and if y = 1 indicates membership in "Class A," then the regression is just P(C1ass A I x ) , as we have already observed. A decision rule, such as "choose Class A if P(C1ass A I x ) > 1/2," then generates a partition of the range of x (call this range H ) into H A = { x : P(C1ass A I x) > 1/2} and its complement H - HA = HA. Thus, x E HA is classified as "A," x E H A is classified as "not A." It may be the case that HA and H A are separated by a regular surface (or "decision boundary"), planar or quadratic for example, or the separation may be highly irregular. Given a sequence of observations ( x l , y l ) (x2, , y2), . . . we can proceed to estimate P(C1ass A I x ) (= E[y I XI), and hence the decision boundary, from two rather different philosophies. On the one hand we can assume a priori that H A is known up to a finite, and preferably small, number of parameters, as would be the case if H A and H A were linearly or quadratically separated, or, on the other hand, we can forgo such assumptions and "let the data speak for itself." The chief advantage of the former, parametric, approach is of course efficiency: if the separation really is planar or quadratic, then many fewer data are needed for accurate estimation than if we were to proceed without parametric specifications. But if the true separation departs substantially from the assumed form, then the parametric approach is destined to converge to an incorrect, and hence suboptimal solution, typically (but depending on details of the estimation algorithm) to a "best" approximation within the allowed class of decision boundaries. The latter, nonparamefric, approach makes no such a priori commitments. The asymptotic (large sample) convergence of an estimator to the object of estimation is called consistency. Most nonparametric regression

6

S. Geman, E. Bienenstock, and R. Doursat

algorithms are consistent, for essentially any regression function E[y I XI.' This is indeed a reassuring property, but it comes with a high price: depending on the particular algorithm and the particular regression, nonparametric methods can be extremely slow to converge. That is, they may require very large numbers of examples to make relatively crude approximations of the target regression function. Indeed, with small samples the estimator may be too dependent on the particular samples observed, that is, on the particular realizations of (x, y) (we say that the variance of the estimator is high). Thus, for a fixed and finite training set, a parametric estimator may actually outperform a nonparametric estimator, even when the true regression is outside of the parameterized class. These issues of bias and variance will be further discussed in Section 3. For now, the important point is that there exist many consistent nonparametric estimators, for regressions as well as probability distributions. This means that, given enough training samples, optimal decision rules can be arbitrarily well approximated. These estimators are extensively studied in the modern statistics literature. Parzen windows and nearestneighbor rules (see, e.g., Duda and Hart 1973; Hardle 1990), regularization methods (see, e.g., Wahba 1982) and the closely related method of sieves (Grenander 1981; Geman and Hwang 1982), projection pursuit (Friedman and Stuetzle 1981; Huber 19851, recursive partitioning methods such as "CART," which stands for "Classification and Regression Trees" (Breiman et al. 1984),Alternating Conditional Expectations, or "ACE" (Breiman and Friedman 1985), and Multivariate Adaptive Regression Splines, or "MARS" (Friedman 1991), as well as feedforward neural networks (Rumelhart et al. 1986a,b) and Boltzmann Machines (Ackley et al. 1985; Hinton and Sejnowski 1986), are a few examples of techniques that can be used to construct consistent nonparametric estimators. 2.3 Some Applications of Nonparametric Inference. In this paper, we shall be mostly concerned with limitations of nonparametric methods, and with the relevance of these limitations to neural network models. But there is also much practical promise in these methods, and there have been some important successes. An interesting and difficult problem in industrial "process specification" was recently solved at the General Motors Research Labs (Lorenzen 1988) with the help of the already mentioned CART method (Breiman et al. 1984). The essence of CART is the following. Suppose that there are m classes, y E (1.2,. . . , m}, and an input, or feature, vector x. Based on a training sample (XI, yl), . . . , (xN,yN) the CART algorithm constructs a partitioning of the (usually high-dimensional) domain of x into rectan*One has to specify the mode of convergence: the estimator is itself a function, and furthermore depends on the realization of a random training set (see Section 4.2). One also has to require certain technical conditions, such as measurability of the regression function.

Neural Networks and the BiasNariance Dilemma

7

gular cells, and estimates the class-probabilities { P ( y = k) : k = 1,. . . , m } within each cell. Criteria are defined that promote cells in which the estimated class probabilities are well-peaked around a single class, and at the same time discourage partitions into large numbers of cells, relative to N. CART provides a family of recursive partitioning algorithms for approximately optimizing a combination of these competing criteria. The GM problem solved by CART concerned the casting of certain engine-block components. A new technology known as lost-foam casting promises to alleviate the high scrap rate associated with conventional casting methods. A Styrofoam ”model” of the desired part is made, and then surrounded by packed sand. Molten metal is poured onto the Styrofoam, which vaporizes and escapes through the sand. The metal then solidifies into a replica of the styrofoam model. Many “process variables” enter into the procedure, involving the settings of various temperatures, pressures, and other parameters, as well as the detailed composition of the various materials, such as sand. Engineers identified 80 such variables that were expected to be of particular importance, and data were collected to study the relationship between these variables and the likelihood of success of the lost-foam casting procedure. (These variables are proprietary.) Straightforward data analysis on a training set of 470 examples revealed no good “first-order” predictors of success of casts (a binary variable) among the 80 process variables. Figure 1 (from Lorenzen 1988) shows a histogram comparison for that variable that was judged to have the most visually disparate histograms among the 80 variables: the left histogram is from a population of scrapped casts, and the right is from a population of accepted casts. Evidently, this variable has no important prediction power in isolation from other variables. Other data analyses indicated similarly that no obvious low-order multiple relations could reliably predict success versus failure. Nevertheless, the CART procedure identified achievable regions in the space of process variables that reduced the scrap rate in this production facility by over 75%. As might be expected, this success was achieved by a useful mix of the nonparametric algorithm, which in principal is fully automatic, and the statistician’s need to bring to bear the realities and limitations of the production process. In this regard, several important modifications were made to the standard CART algorithm. Nevertheless, the result is a striking affirmation of the potential utility of nonparametric methods. There have been many success stories for nonparametric methods. An intriguing application of CART to medical diagnosis is reported in Goldman et al. (19821, and further examples with CART can be found in Breiman et al. (1984). The recent statistics and neural network literatures contain examples of the application of other nonparametric methods as well. A much-advertised neural network example is the evaluation of loan applications (cf. Collins et al. 1989). The basic problem is to classify a loan candidate as acceptable or not acceptable based on 20 or so

8

S. Geman, E. Bienenstock, and R. Doursat

Figure 1: Left histogram: distribution of process variable for unsuccessful castings. Right histogram: distribution of same process variable for successful castings. Among all 80 process variables, this variable was judged to have the most dissimilar success/failure histograms. (Lorenzen 1988)

variabIes summarizing an applicant’s financial status, These include, for example, measures of income and income stability, debt and other financial obligations, credit history, and possibly appraised values in the case of mortgages and other secured loans. A conventional parametric statistical approach is the so-called logit model (see, for example, Cox 1970), which posits a linear relationship between the logistic transformation of the desired variable (here the probability of a successful return to the lender) and the relevant independent variables (defining financial s t a t u ~ ) .Of ~ course, a linear model may not be suitable, in which case the logit estimator would perform poorly; it would be too biased. On the other hand, very large training sets are available, and it makes good sense to try less parametric methods, such as the backpropagation algorithm, the nearest-neighbor algorithm, or the ”Multiple-Neural-Network Learning System” advocated for this problem by Collins et al. (1989).

3The logistic transformation of a probability p is log,[p/(l

-

p)].

Neural Networks and the BiasNariance Dilemma

9

These examples will be further discussed in Section 5, where we shall draw a sharp contrast between these relatively easy tasks and problems arising in perception and in other areas of machine intelligence. 3 Bias and Variance

3.1 The BiasIVariance Decomposition of Mean-Squared Error. The regression problem is to construct a functionf(x) based on a ”training set” ( x ~ , y ~. .). , (XN,YN), for the purpose of approximating y at future observations of x. This is sometimes called “generalization,” a term borrowed from psychology. To be explicit about the dependence o f f on the data V = {(XI,yl), . . . , (XN, YN)},we will write f(x; V )instead of simply f(x). Given D,and given a particular x, a natural measure of the effectiveness off as a predictor of y is

the mean-squared error (where €[.I means expectation with respect to the probability distribution P, see Section 2). In our new notation emphasizing the dependency of f on V (which is fixed for the moment), equation 2.2 reads

E [(Y - f ( x ; m 2

IX>D]

= E

[(Y - E [ y

+ (f(x; 2))

I XI)’ -

I X > 4

E[y I XI)’

E[(y-Ely I x ] ) I~x, V ]does not depend on the data, V,or on the estimator, f; it is simply the variance of y given x. Hence the squared distance to the regression function,

measures, in a natural way, the effectiveness off as a predictor of y. The mean-squared error off as an estimator of the regression E [ y I x] is €27

[(f(x;D)- Ely

I XI)’]

(3.1)

where E D represents expectation with respect to the training set, V,that is, the average over the ensemble of possible V (for fixed sample size N ) . It may be that for a particular training set, V,f(x;V) is an excellent approximation of Ely I x], hence a near-optimal predictor of y. At the same time, however, it may also be the case that f(x; D)is quite different for other realizations of D,and in general varies substantially with V,or it may be that the average (over all possible V)of f(x; V )is rather far from the regression E[y I XI. These circumstances will contribute large values in 3.1, making f(x; V)an unreliable predictor of y. A useful way to assess

S. Geman, E. Bienenstock, and R. Doursat

10

these sources of estimation error is via the bias/ variance decomposition, which we derive in a way similar to 2.2 for any x, ED

[ ( f2)) h- E l y I XI)’] = ED

[ ( ( f ( x ; D-) ED V(x;DO)l)+ ( E D [ f ( x ; D )] Ely

I XI))‘]

[ ( f ( x ; D) E D [f(x;D)I)’]+ E D [ ( E D [ f ( x ; D )] E[y I XI)’] + ED [ ( f ( x ;2)) - ED [ f ( D~ ) ] )( E D [ f ( D)] ~ - E[y I X I ) ] = E D [ ( f ( x ;D )- E D [ f ( D ~ ) ] + ( E D V(X;D)]- E[y I XI)’ + ED [ f ( x ; D -) E D [ f ( x ; D ) ].]( E D [f(x;D)] - E[y I x ] ) “bias” = ( E D [ f ( x ; D )] Ely I ~ 1 ) ~ ”variance” E~ [ ~ ( xD) ; - E= ~ ( xD)] ;

= ED

I’)

+

)’I

If, on the average, f ( x ;D)is different from E[y I x], then f ( x ;D)is said to be biased as an estimator of E[y I XI. In general, this depends on P; the same f may be biased in some cases and unbiased in others. As said above, an unbiased estimator may still have a large meansquared error if the variance is large: even with EDV(X;D)]= E l y I x], f ( x ; D ) may be highly sensitive to the data, and, typically, far from the regression E l y I XI. Thus either bias or variance can contribute to poor performance. There is often a tradeoff between the bias and variance contributions to the estimation error, which makes for a kind of “uncertainty principle” (Grenander 1951). Typically, variance is reduced through “smoothing,” via a combining, for example, of the influences of samples that are nearby in the input ( x ) space. This, however, will introduce bias, as details of the regression function will be lost; for example, sharp peaks and valleys will be blurred. 3.2 Examples. The issue of balancing bias and variance is much studied in estimation theory. The tradeoff is already well illustrated in the one-dimensional regression problem: x = x E [0,1]. In an elementary version of this problem, y is related to x by

Y = g ( x ) + 77

(3.2)

where g is an unknown function, and 77 is.zero-mean “noise” with distribution independent of x. The regression is then g ( x ) , and this is the best (mean-squared-error) predictor of y. To make our points more clearly, we will suppose, for this example, that only y is random - x can be chosen as we please. If we are to collect N observations, then a natural ”design” for the inputs is xi = i/N, 1 5 i 5 N, and the data are then the corresponding N values of y, 2, = {yl, . . . ,yN}. An example (from Wahba and Wold 19751, with N = 100, g(x) = 4.26(ecX- 4e-& 3ec3’), and 77 gaussian with standard deviation 0.2, is shown in Figure 2. The squares are the data points, and the broken curve, in each panel, is the

+

Neural Networks and the Bias/Variance DiIemma

11

a

b

t

C

Figure 2: One hundred observations (squares) generated according to equation 4, with g ( x ) = 4.26(eCX - 4eCZX+ 3 ~ ~The ~ noise ) . is zero-mean gaussian with standard error 0.2. In each panel, the broken curve is g and the solid curve is a spline fit. (a) Smoothing parameter chosen to control variance. (b) Smoothing parameter chosen to control bias. (c) A compromising value of the smoothing parameter, chosen automatically by cross-validation. (From Wahba and Wold 1975)

regression, g(X). (The solid curves are estimates of the regression, as will be explained shortly.) The object is to make a guess at g(x), using the noisy observations, yi = g(xi) yi, 1 5 i 5 N. At one extreme, f ( x ; D ) could be defined as the linear (or some other) interpolant of the data. This estimator is truly unbiased at x = xi, 1 5 i 5 N, since

+

+~

E D v ( X i ; D)]== E [g(Xi)

i ] g(Xi) = E

[y

I xi]

Furthermore, if g is continuous there is also very little bias in the vicinity of the observation points, X i , 1 5 i 5 N. But if the variance of y is large, then there will be a large variance component to the mean-squared error (3.11, since ED

[V(xi;D) - E D v ( x i ; ~ ) 1 ) ~=] E D [k(xi)+ vi - g(xi))’] = E

12

S. Geman, E. Bienenstock, and R. Doursat

which, since qi has zero mean, is the variance of 77,. This estimator is indeed very sensitive to the data. At the other extreme, we may take f ( x ; V ) = k ( x ) for some wellchosen function h(x), independent of V . This certainly solves the variance problem! Needless to say, there is likely to be a substantial bias, for this estimator does not pay any attention to the data. A better choice would be an intermediate one, balancing some reasonable prior expectation, such as smoothness, with faithfulness to the observed data. One example is a feedforward neural network trained by error backpropagation. The output of such a network isf(x; w) = f [ x ; w(V)], where w(D)is a collection of weights determined by (approximately) minimizing the sum of squared errors:

(3.3) How big a network should we employ? A small network, with say one hidden unit, is likely to be biased, since the repertoire of available functions spanned by f ( x ; w) over allowable weights will in this case be quite limited. If the true regression is poorly approximated within this class, there will necessarily be a substantial bias. On the other hand, if we overparameterize, via a large number of hidden units and associated weights, then the bias will be reduced (indeed, with enough weights and hidden units, the network will interpolate the data), but there is then the danger of a significant variance contribution to the mean-squared error. (This may actually be mitigated by incomplete convergence of the minimization algorithm, as we shall see in Section 3.5.5.) Many other solutions have been invented, for this simple regression problem as well as its extensions to multivariate settings (y -+ y E Rd, x + x E R‘, for some d > 1 and I > 1). Often splines are used, for example. These arise by first restricting f via a “smoothing criterion” such as

for some fixed integer m 2 1 and fixed A. (Partial and mixed partial derivatives enter when x + x E R’; see, for example, Wahba 1979.) One then solves for the minimum of

among all f satisfying equation 3.4. This minimization turns out to be tractable and yields f ( x ) = f ( x ; D ) , a concatenation of polynomials of degree 2m-1 on the intervals (xi, x i + l ) ; the derivatives of the polynomials, up to order 2m -2, match at the “knots” {xi}El. With m = 1, for example, the solution is continuous and piecewise linear, with discontinuities in

Neural Networks and the BiasNariance Dilemma

13

the derivative at the knots. When m = 2 the polynomials are cubic, the first two derivatives are continuous at the knots, and the curve appears globally “smooth.” Poggio and Girosi (1990) have shown how splines and related estimators can be computed with multilayer networks. The “regularization” or ”smoothing” parameter X plays a role similar to the number of weights in a feedforward neural network. Small X produce small-variance high-bias estimators; the data are essentially ignored in favor of the constraint (“oversmoothing”). Large values of X produce interpolating splines: f ( x ; ; 2)) = y,, 1 5 i 5 N, which, as we have seen, may be subject to high variance. Examples of both oversmoothing and undersmoothing are shown in Figure 2a and b, respectively. The solid lines are cubic-spline (rn = 2) estimators of the regression. There are many recipes for choosing A, and other smoothing parameters, from the dnfa, a procedure known as ”automatic smoothing” (see Section 4.1). A popular example is called cross-validation (again, see Section 4,1), a version of which was used in Figure 2c. There are of course many other approaches to the regression problem. Two in particular are the nearest-neighbor estimators and the kernel estimators, which we have used in some experiments both on artificial data and on handwritten numeral recognition. The results of these experiments will be reviewed in Section 3.5. 3.3 Nonparametric Estimation. Nonparametric regression estimators are characterized by their being consistent for all regression problems. Consistency requires a somewhat arbitrary specification: in what sense does the estimator f(x;V)converge to the regression E [ y I Let us be explicit about the dependence off on sample size, N, by writing V = V N and then f ( x ; D ~for ) the estimator, given the N observations DN. One version of consistency is ”pointwise mean-squared error”:

XI?

for each x. A more gIobal specification is in terms of integrated meansquared error: (3.5) There are many variations, involving, for example, almost sure convergence, instead of the mean convergence that is defined by the expectation operator ED. Regardless of the details, any reasonable specification will require that both bias and variance go to zero as the size of the training sample increases. In particular, the class of possible functions f(x; V N ) must approach Ely I x] in some suitable sense? or there will necessarily 4The appropriate metric is the one used to define consistency. Lz, for example, with 3.5.

14

S. Geman, E. Bienenstock, and R. Doursat

be some residual bias. This class of functions will therefore, in general, have to grow with N. For feedforward neural networks, the possible functions are those spanned by all allowed weight values. For any fixed architecture there will be regressions outside of the class, and hence the network cannot realize a consistent nonparametric algorithm. By the same token, the spline estimator is not consistent (in any of the usual senses) whenever the regression satisfies

since the estimator itself is constrained to violate this condition (see equation 3.4). It is by now well-known (see, e.g., White 1990) that a feedforward neural network (with some mild conditions on Ely I x] and network structure, and some optimistic assumptions about minimizing 3.3) can be made consistent by suitably letting the network size grow with the size of the training set, in other words by gradually diminishing bias. Analogously, splines are made consistent by taking X = AN T 03 sufficiently slowly. This is indeed the general recipe for obtaining consistency in nonparametric estimation: slowly remove bias. This procedure is somewhat delicate, since the variance must also go to zero, which dictates a gradual reduction of bias (see discussion below, Section 5.1). The main mathematical issue concerns this control of the variance, and it is here that tools such as the Vapnikxervonenkis dimension come into play. We will be more specific in our brief introduction to the mathematics of consistency below (Section 4.2). As the examples illustrate, the distinction between parametric and nonparametric methods is somewhat artificial, especially with regards to fixed and finite training sets. Indeed, most nonparametric estimators, such as feedforward neural networks, are in fact a sequence of parametric estimators indexed by sample size. 3.4 The Dilemma. Much of the excitement about artificial neural networks revolves around the promise to avoid the tedious, difficult, and generally expensive process of articulating heuristics and rules for machines that are to perform nontrivial perceptual and cognitive tasks, such as for vision systems and expert systems. We would naturally prefer to "teach our machines by example, and would hope that a good learning algorithm would "discover" the various heuristics and rules that apply to the task at hand. It would appear, then, that consistency is relevant: a consistent learning algorithm will, in fact, approach optimal performance, whatever the task. Such a system might be said to be unbiased, as it is not a priori dedicated to a particular solution or class of solutions. But the price to pay for achieving low bias is high variance. A machine sufficiently versatile to reasonably approximate a broad range of

Neural Networks and the BiasNariance Dilemma

15

input/ output mappings is necessarily sensitive to the idiosyncrasies of the particular data used for its training, and therefore requires a very large training set. Simply put, dedicated machines are harder to build but easier to train. Of course there is a quantitative tradeoff, and one can argue that for many problems acceptable performance is achievable from a more or less tabula rasa architecture, and without unrealistic numbers of training examples. Or that specific problems may suggest easy and natural specific structures, which introduce the "right" biases for the problem at hand, and thereby mitigate the issue of sample size. We will discuss these matters further in Section 5. 3.5 Experiments in Nonparametric Estimation. In this section, we shall report on two kinds of experiments, both concerning classification, but some using artificial data and others using handwritten numerals. The experiments with artificial data are illustrative since they involve only two dimensions, making it possible to display estimated regressions as well as bias and variance contributions to mean-squared error. Experiments were performed with nearest-neighbor and Parzen-window estimators, and with feedforward neural networks trained via error backpropagation. Results are reported following brief discussions of each of these estimation methods.

3.5.1 Nearest-Neighbor Regression. This simple and time-honored approach provides a good performance benchmark. The "memory" of the . any machine is exactly the training set V = {(xI,yI), , . .,(XN, y ~ ) ) For input vector x, a response vector y is derived from the training set by averaging the responses to those inputs from the training set which happen to lie close to x. Actually, there is here a collection of algorithms indexed by an integer, "k," which determines the number of neighbors of x that enter into the average. Thus, the k-nearest-neighbor estimator is just

where Nk(x) is the collection of indices of the k nearest neighbors to x among the input vectors in the training set {xi}:,. (There is also a k-nearest-neighbor procedure for classification: If y = y E { 1,2, . . . , C}, representing C classes, then we assign to x the class y E { 1,2, . . . , C) most frequent among the set { ~ i } i ~ N ~ ( where ~), y i is the class of the training input xi.) If k is "large" (e.g., k is almost N)then the response f(x; V)is a relatively smooth function of x, but has little to do with the actual positions of the x[s in the training set. In fact, when k = N , f(x;D) is independent of x, and of {xi}El; the output is just the average observed output 1 / N ELl yi. When N is large, 1 / N ELl yi is likely to be nearly unchanged

16

S. Geman, E. Bienenstock, and R. Doursat

from one training set to another. Evidently, the variance contribution to mean-squared error is then small. On the other hand, the response to a particular x is systematically biased toward the population response, regardless of any evidence for local variation in the neighborhood of x. For most problems, this is of course a bad estimation policy. The other extreme is the first-nearest-neighbor estimator; we can expect less bias. Indeed, under reasonable conditions, the bias of the firstnearest-neighbor estimator goes to zero as N goes to infinity. On the other hand, the response at each x is rather sensitive to the idiosyncrasies of the particular training examples in D. Thus the variance contribution to mean-squared error is typically large. From these considerations it is perhaps not surprising that the best solution in many cases is a compromise between the two extremes k = 1 and k = N. By choosing an intermediate k, thereby implementing a reasonable amount of smoothing, one may hope to achieve a significant reduction of the variance, without introducing too much bias. If we now consider the case N + 00, the k-nearest-neighbor estimator can be made consistent by choosing k = k~ T 00 sufficiently slowly. The idea is that the variance is controlled (forced to zero) by k~ T m, whereas the bias is controlled by ensuring that the kNth nearest neighbor of x is actually getting closer to x as N -+ m. 3.5.2 Parzen- Window Regression. The "memory" of the machine is again the entire training set D, but estimation is now done by combining "kernels," or "Parzen windows," placed around each observed input point xi, 1 5 i 5 N.The form of the kernel is somewhat arbitrary, but it is usually chosen to be a nonnegative function of x that is maximum at x = 0 and decreasing away from x = 0. A common choice is

the gaussian kernel, for x E Rd. The scale of the kernel is adjusted by a "bandwidth CJ: W ( x )-+ (1/0)~W(x/o).The effect is to govern the extent to which the window is concentrated at x = 0 (small CJ),or is spread out over a significant region around x = 0 (large (T). Having fixed a kernel W(.), and a bandwidth CJ, the Parzen regression estimator at x is formed from a weighted average of the observed responses {yi}El:

Clearly, observations with inputs closer to x are weighted more heavily. There is a close connection between nearest-neighbor and Parzenwindow estimation. In fact, when the bandwidth o is small, only close neighbors of x contribute to the response at this point, and the procedure is akin to k-nearest-neighbor methods with small k. On the other hand,

Neural Networks and the Bias/Variance Dilemma

17

when o is large, many neighbors contribute significantly to the response, a situation analogous to the use of large values of k in the k-nearestneighbor method. In this way, D governs bias and variance much as k does for the nearest-neighbor procedure: small bandwidths generally offer high-variance/low-bias estimation, whereas large bandwidths incur relatively high bias but low variance. There is also a Parzen-window procedure for classification: we assign to x the class y = y E {1,2, . . . C} which maximizes ~

where Ny is the number of times that the classification y is seen in the training set, N,,= #{i : yi = y}. If W ( x )is normalized, so as to integrate to one, then fu(x;23) estimates the density of inputs associated with the class y (known as the ”class-conditional density”). Choosing the class with maximum density at x results in minimizing the probability of error, at least when the classes are a priori equally likely. (If the a priori probabilities of the C classes, p(y) y E (1.2,. . . , C}, are unequal, but known, then the minimum probability of error is obtained by choosing y to maximize P(Y) . f y ( x ; W 3.5.3 Feedforward Network Trained by Error Backpropagation. Most readers are already familiar with this estimation technique. We used twolayer networks, that is, networks with one hidden layer, with full connections between layers. The number of inputs and outputs depended on the experiment and on a coding convention; it will be laid out with the results of the different experiments in the ensuing paragraphs. In the usual manner, all hidden and output units receive one special input that is nonzero and constant, allowing each unit to learn a “threshold.” Each unit outputs a value determined by the sigmoid function (3.8)

given the input

Here, {<,} represents inputs from the previous layer (together with the above-mentioned constant input) and {w,} represents the weights (”synaptic strengths”) for this unit. Learning is by discrete gradient descent, using the full training set at each step. Thus, if w(t) is the ensemble of all weights after t iterations, then

w(t + 1 ) = w(t)- FV,&[W(t)]

(3.9)

18

S. Geman, E. Bienenstock, and R. Doursat

where t is a control parameter, V, is the gradient operator, and E(w) is the mean-squared error over the training samples. Specifically, if f(x; w) denotes the (possibly vector-valued) output of the feed forward network given the weights w, then

where, as usual, the training set is 2, = ((XI, yl), . . . , (XN, y ~ ) } The . gradient, V,E(w), is calculated by error backpropagation (see Rumelhart et al. 1986a,b). The particular choices of F and the initial state w(0), as well as the number of iterations of 3.9, will be specified during the discussion of the experimental results. It is certainly reasonable to anticipate that the number of hidden units would be an important variable in controlling the bias and variance contributions to mean-squared error. The addition of hidden units contributes complexity and presumably versatility, and the expected price is higher variance. This tradeoff is, in fact, observed in experiments reported by several authors (see, for example, Chauvin 1990; Morgan and Bourlard 1990), as well as in our experiments with artificial data (Section 3.5.4). However, as we shall see in Section 3.5.5, a somewhat more complex picture emerges from our experiments with handwritten numerals (see also Martin and Pittman 1991). 3.5.4 Experiments with Artificial Data. The desired output indicates one of two classes, represented by the values f . 9 . [This coding was used in all three methods, to accommodate the feedforward neural network, in which the sigmoid output function equation 3.8 has asymptotes at i l . By coding classes with values 50.9, the training data could be fit by the network without resorting to infinite weights.] In some of the experiments, the classification is unambiguously determined by the input, whereas in others there is some "overlap" between the two classes. In either case, the input has two components, x = (x1,x*),and is drawn from the rectangle [-6,6] x [-1.5,1.5]. In the unambiguous case, the classification is determined by the curve x2 = sin(txl),which divides the rectangle into "top" [x2 >_ sin((n/2)x~), y = 0.91 and "bottom" [xq < sin((7r/2)xl),y= -0.91 pieces. The regression is then the binary-valued function E[y 1 x] = .9 above the sinusoid and -0.9 below (see Fig. 3a). The training set, 2, = {(x,,yl), .. . , (xN,YN)}, is constructed to have 50 examples from each class. For y = 0.9, the 50 inputs are chosen from the uniform distribution on the region above the sinusoid; the y = -0.9 inputs are chosen uniformly from the region below the sinusoid. Classification can be made ambiguous within the same basic setup, by randomly perturbing the input vector before determining its class. To describe precisely the random mechanism, let us denote by B1 (x) the

Neural Networks and the Bias/Variance Dilemma

19

a

b

Figure 3: Two regression surfaces for experimentswith artificial data. (a)Output is deterministic function of input, f0.9 above sinusoid, and -0.9 below sinusoid. (b) Output is perturbed randomly. Mean value of zero is coded with white, mean value of $0.9 is coded with gray, and mean value of -0.9 is coded with black.

disk of unit radius centered at x. For a given x, the classification y is chosen randomly as follows: x is "perturbed" by choosing a point z from the uniform distribution on Bl(x), and y is then assigned value 0.9 if z2 2 sin((7r/2)zl), and -0.9 otherwise. The resulting regression, E[y I XI], is depicted in Figure 3b, where white codes the value zero, gray codes the value f0.9, and black codes the value -0.9. Other values are coded by interpolation. (This color code has some ambiguity to it: a given gray level does not uniquely determine a value between -0.9 and 0.9. This code was chosen to emphasize the transition region, where y x 0.) The effect of the classification ambiguity is, of course, most pronounced near the "boundary" x2 = sin((7r/Z)xI). If the goal is to minimize mean-squared error, then the best response to a given x is E [ y I XI. On the other hand, the minimum error classifier will assign class "$0.9" or "-0.9" to a given x, depending on whether E[y I x] 2 0 or not: this is the decision function that minimizes the probability of misclassifying x. The decision boundary of the optimal classifier ({x : E[y I x] = 0)) is very nearly the original sinusoid x2 = sin((a/2)xl);it is depicted by the whitest values in Figure 3b. The training set for the ambiguous classification task was also constructed to have 50 examples from each class. This was done by repeated Monte Carlo choice of pairs (x,y), with x chosen uniformly from the rectangle [-6,6] x [-1.5,1.5] and y chosen by the above-described random

S. Geman, E. Bienenstock, and R. Doursat

20

mechanism. The first 50 examples for which y = 0.9 and the first 50 examples for which y = -0.9 constituted the training set. In each experiment, bias, variance, and mean-squared error were evaluated by a simple Monte Carlo procedure, which we now describe. Denote by f(x;V) the regression estimator for any given training set V. Recall that the (squared) bias, at x, is just (E.DV(x;.O)l - Ely I xll2 and that the variance is

ED [(f(x;V)- E D ~ ( X ; ~ ) ] ) ~ ] These were assessed by choosing, independently, 100 training sets V', .O2# . .., Dl00 , and by forming the corresponding estimators f(x; Dl), ..., f(x;D,'"). Denote by f(x) the average response at x: f(x) = ( l / l O O ) EkEl f(x;@). Bias and variance were estimated via the formulas: Bias(x)

=

(f(x) - Ely

I XI)'

(Recall that E[y I x] is known exactly - see Fig. 3.) The sum, Bias(x) Variance(x) is the (estimated) mean-squared error, and is equal to

+

In several examples we display Bias(x) and Variance(x) via gray-level pictures on the domain [-6,6] x [-1.5,1.5]. We also report on integrated bias, variance, and mean-squared error, obtained by simply integrating these functions over the rectangular domain of x (with respect to the uniform distribution). The experiments with artificial data included both nearest-neighbor estimation and estimation by a feedforward neural network. Results of the experiments with the nearest-neighbor procedure are summarized in Figure 4. In both the deterministic and the ambiguous case, bias increased while variance decreased with the number of neighbors, as should be expected from our earlier discussion. In the deterministic case, the least meansquared error is achieved using a small number of neighbors, two or three; there is apparently, and perhaps not surprisingly, little need to control the variance. In contrast, the more biased eight- or nine-nearestneighbor estimator is best for the ambiguous task. Figures 5 through 7 demonstrate various additional features of the results from experiments with the ambiguous classification problem. Figure 5 shows the actual output, to each possible input, of two machines

Neural Networks and the Bias/Variance Dilemma

21

Deterministic Classification

0.4,

I

I 1

2

3

4

2

3

4

5

6

I

8

9

10

5

6

I

8

9

10

0.5 0.4

0.3 0.2 0.1

0

1

# Neighbors

Figure 4: Integrated bias (os), variance (xs), and total error (+s) as functions of the number of neighbors in the nearest-neighborregression. trained on a typical sample of the data: Figure 5a is the first-nearestneighbor solution and Figure 5b is the two-nearest-neighbor solution. The actual training data are also displayed - see figure legend for interpretation. Average output of the five-nearest-neighbor machine (averaged over 100 independently chosen training sets - see earlier discussion) is depicted in Figure 6 (using the same color convention as for the regression). Compare this with the regression (Fig. 3b): there apparently is very little bias. Finally, in Figure 7, bias and variance are displayed as functions of the input x, both for the first-nearest-neighbor and the 10nearest-neighbor machines. Notice again the tradeoff. An analogous pattern emerged from the experiments with the feedforward neural network. In these experiments, the error-backpropagation algorithm (see equation 3.9) was run for 3,000 iterations, using t = 0.05, and initializing the weights as independent random variables chosen from the uniform distribution on [-0.2,0.2]. The results are summarized in Figure 8. The relatively unbiased, high-variance, 15-hidden-unit machine is best for the simpler deterministic classification task. For the ambiguous task, the more biased, single-hidden-unit machine is favored. Figure 9 shows

22

S. Geman, E. Bienenstock, and R. Doursat

Figure 5: Nearest-neighbor estimates of regression surface shown in Figure 3b. Gray-level code is the same as in Figure 3. The training set comprised 50 examples with values +0.9 (circles with white centers) and 50 examples with values -0.9 (circles with black centers). (a) First-nearest-neighbor estimator. (b) Two-nearest-neighbor estimator.

Figure 6: Average output of 100 five-nearest-neighbor machines, trained on independent data sets. Compare with the regression surface shown in Figure 3b (gray-level code is the same) - there is little bias. the output of two typical machines, with five hidden units each. Both were trained for the ambiguous classification task, but using statistically independent training sets. The contribution of each hidden unit is partially revealed by plotting the line ~ 1 x t1 ~ 2 x 2 w3 = 0, where x = (xI,x2)is the input vector, w1and w2are the associated weights, and w3 is the threshold. On either side of this line, the unit’s output is a function solely of distance to the line. The differences between these two machines hint at the variance contribution to mean-squared error (roughly 0.13 - see Fig. 8). For the same task and number of hidden

+

Neural Networks and the Bias/Variance Dilemma

23

Figure 7: Bias and variance of first-nearest-neighbor and 10-nearest-neighbor estimators, as functions of input vector, for regression surface depicted in Figure 3b. Scale is by gray levels, running from largest values, coded in black, to zero, coded in white. (a) Bias of first-nearest-neighbor estimator. (b) Variance of first-nearest neighbor estimator. (c) Bias of 10-nearest-neighbor estimator. (d) Variance of 10-nearest-neighborestimator. Overall, the effect of additional neighbors is to increase bias and decrease variance.

units, the bias contribution to error is relatively small (0.05, again from Fig. 8). This is clear from Figure 10, which shows the average output of the five-hidden-unit machine to the ambiguous classification task. The fit to the regression (Fig. 3b) is good, except for some systematic bias at the left and right extremes, and at the peaks and valleys of the sinusoid. Finally, with reference again to the ambiguous classification task,

S. Geman, E. Bienenstock, and R. Doursat

24

Determiniistic Classification

0.5 I

"0

2

4

n

6

10

12

14

16

12

14

16

Ambiguous Classification

02

0 0s ---w

00

2

4

6

x

8

10

# Hidden Units

Figure 8: Integrated bias (os), variance (xs), and total error (+s) as functions of the number of hidden units in a feedforward neural network. Figure 11 shows bias and variance contributions to error for the onehidden-unit and the 15-hidden-unit machines. The pattern is similar to Figure 7 (nearest-neighbor machines), and reflects the tradeoff already apparent in Figure 8.

3.5.5 Experiments with Handwritten Numerals. The data base in these experiments consisted of 1200 isolated handwritten numerals, collected from 12 individuals by I. Guyon at the AT&T Bell Laboratories (Guyon 1988). Each individual provided 10 samples of each of the 10 digits, 0 , 1 , .. . ,9. Each sample was digitized and normalized to fit exactly within a 16 x 16 pixel array; it was then thresholded to produce a binary picture. A sampling of characters from this data base is displayed in the top four rows of Figure 12. The problem of recognizing characters in this set is rather easy, at least when compared to other widely available data sets, involving for example postal zip codes (see Le Cun et al. 1989)or courtesy numbers from checks. In fact, the data were collected with the intention of producing a more or less canonical set of examples: a standard "model" was chosen for each digit and the 12 individuals were asked to follow the model. However, our interest here was to demonstrate generic features of nonparametric estimation, and this turned out to be more easily done

Neural Networks and the Bias/Variance Dilemma

25

Figure 9: Output of feedfonvard neural networks trained on two independent samples of size 100. Actual regression is depicted in Figure 3b, with the same gray-level code. The training set comprised 50 examples with values +0.9 (circles with white centers) and 50 examples with values -0.9 (circles with black centers). Straight lines indicate points of zero output for each of the five hidden units - outputs are functions of distance to these lines. Note the large variation between these machines. This indicates a high variance contribution to mean-squared error.

Figure 1 0 Average output of 100 feedforward neural networks with five hidden units each, trained on independent data sets. The regression surface is shown in Figure 3b, with the same gray-level code. with a somewhat harder problem; we therefore replaced the digits by a new, “corrupted,” training set, derived by flipping each pixel (black to white or white to black), independently, with probability 0.2. See the bottom four rows of Figure 12 for some examples. Note that this corruption does not in any sense mimic the difficulties encountered in

26

S. Geman, E. Bienenstock, and R. Doursat

a

b

C

Figure 11: Bias and variance of single-hidden-unit and 15-hidden-unit feedforward neural networks, as functions of input vector. Regression surface is depicted in Figure 3b. Scale is by gray levels, running from largest values, coded in black, to zero, coded in white. (a) Bias of single-hidden-unit machine. (b)Variance of single-hidden-unit machine. (c) Bias of 15-hidden-unit machine. (d) Variance of 15-hidden-unit machine. Bias decreases and variance increases with the addition of hidden units. real problems of handwritten numeral recognition; the latter are linked to variability of shape, style, stroke widths, etc. The input x is a binary 16 x 16 array. We perform no ”feature extraction” or other preprocessing. The classification, or output, is coded via a 10-dimensional vector y = (yo,. . . ,y ~ ) where , yi = +0.9 indicates the digit ’5,” and yi = -0.9 indicates “not i.” Each example in the (noisy) data set is paired with the correct classification vector, which has one component with value +0.9 and nine components with values -0.9.

Neural Networks and the Bias/Variance Dilemma

27

Figure 12: Top four rows: examples of handwritten numerals. Bottom four rows: same examples, corrupted by 20% flip rate (black to white or white to black).

To assess bias and variance, we set aside half of the data set (600 digits), thereby excluding these examples from training. Let us de, the renote these excluded examples by (x601, y601), . . . , (xI~M),~ I ~ o o ) and maining examples by (xl, yl), . . . , (x600,y6M)). The partition was such that each group contained 60 examples of each digit; it was otherwise random. Algorithms were trained on subsets of {(xl,yl)}fz, and assessed on {(xi,Yi)};2%. Each training set V consisted of 200 examples, 20 of each digit, chosen randomly from the set {(xi,yi)}z:. As with the previous data set, performance statistics were collected by choosing independent training sets, V1, . . . ,P , and forming the associated (vector-valued) estimators f(x; V'), . . ., f(x; p).The performances of the nearest-neighbor and Parzen-window methods were assessed by using M = 100 independent training sets. For the error-backpropagation procedure, which is much more computationally intensive, M = 50 training sets were generally used. Let us again denote by f(x) the average response at x over all training sets. For the calculation of bias, this average is to be compared with the

28

S. Geman, E. Bienenstock, and R. Doursat

regression E[y I XI. Unlike the previous example (“artificial data”), the regression in this case is not explicitly available. Consider, however, the 600 noisy digits in the excluded set: {x1};2’&. Even with 20% corruption, the classification of these numerals is in most cases unambiguous, as judged from the bottom rows of Figure 12. Thus, although this is not quite what we called a “degenerate“ case in Section 2.1, we can approximate the regression at XI, 601 5 I 5 1200, to be the actual classification, y~: E[y I XI] x yl. Of course there is no way to display visually bias and variance as functions of x, as we did with the previous data, but we can still calculate approximate integrated bias, variance, and mean-squared error, using the entire excluded set, XbOl, . . . ,xl20o, and the associated (approximate) “regressions” y 6 0 l . . . . , ~ 1 2 0 0: Integrated Bias

Integrated Mean-Squared Error

M

=

1

1200

‘O0

f=601

1

1200

C

~

-

C

6oo /=60]

1f(xI)

-

yII2

1 M -

C If(xr; Dk) yl 12 -

k=l

The last estimate (for integrated mean-squared error) is exactly the sum of the first two (for integrated bias and integrated variance). Notice that the nearest-neighbor and Parzen-window estimators are both predicated on the assignment of a distance in the input (x) space, which is here the space of 16 x 16 binary images, or, ignoring the lattice structure, simply (0. 1}256. We used Hamming distance for both estimators. (In Section 6 , we shall report on experiments using a different metric for this numeral recognition task.) The kernel for the Parzenwindow experiments was the exponential: W ( x ) = exp{-lxl}, where 1x1 is the Hamming distance to the zero vector. We. have already remarked on the close relation between the kernel and nearest-neighbor methods. It is, then, not surprising that the experimental results for these two methods were similar in every regard. Figures 13 and 14 show the bias and variance contributions to error, as well as the total (mean-squared) error, as functions of the respective “smoothing parameters” - the number of nearest neighbors and the kernel ”bandwidth.” The bias/variance tradeoff is well illustrated in both figures. As was already noted in Sections 3.5.1 and 3.5.2, either machine can be used as a classifier. In both cases, the decision rule is actually asymptotically equivalent to implementing the obvious decision function: choose that classification whose 10-component coding is closest to the machine‘s output. To help the reader calibrate the mean-squared-error scale in these figures, we note that the values 1 and 2 in mean-squared error correspond, roughly, to 20 and 40% error rates, respectively.

Neural Networks and the Bias/Variance Dilemma

1

TotalError

r:4,

29

A

Bias

Variance

k

Figure 13: Nearest-neighbor regression for handwritten numeral recognition. Bias, variance, and total error as a function of number of neighbors. The results of experiments with the backpropagation network are more complex. Indeed, the networks output to a given input x is not uniquely defined by the sole choice of the training set V and of a “smoothing parameter,” as it is in the nearest-neighbor or the Parzen-window case. As we shall now see, convergence issues are important, and may introduce considerable variation in the behavior of the network. In the following experiments, the learning algorithm (equation 3.9) was initialized with independent and uniformly distributed weights, chosen from the interval [-0.1,0.1]; the gain parameter, E, was 0.1. Figure 15 shows bias, variance, and total error, for a four-hiddenunit network, as a function of iteration trial (on a logarithmic scale). We observe that minimum total error is achieved by stopping the training after about 100 iterations, despite the fact that the fit to the training data

S. Geman, E. Bienenstock, and R. Doursat

30

Variance

10

u)

25

Figure 14: Kernel regression for handwritten numeral recognition. Bias, variance, and total error as a function of kernel bandwidth.

continues to improve, as depicted by the curve labeled “learning.“ Thus, even with just four hidden units, there is a danger of “overfitting,“ with consequent high variance. Notice indeed the steady decline in bias and increase in variance as a function of training time. This phenomenon is strikingly similar to one observed in several other applications of nonparametric statistics, such as maximum-likelihood reconstruction for emission tomography (cf. Vardi et al. 1985; Veklerov and Llacer 1987). In that application, the natural “cost functional” is the (negative) log likelihood, rather than the observed mean-squared error. Somewhat analogous to gradient descent is the “ E - M algorithm (Dempster et al. 1976, but see also Baum 1972) for iteratively increasing likelihood. The reconstruction is defined on a pixel grid whose resolution plays a role similar to the number of hidden units. For sufficiently fine grids,

Neural Networks and the Bias/Variance Dilemma

31

log(time)

Figure 15: Neural network with four hidden units trained by error backpropagation. The curve marked “Learning” shows the mean-squared error, on the training set, as a function of the number of iterations of backpropagation [denoted ”log(time)”]. The best machine (minimum total error) is obtained after about 100 iterations; performance degrades with further training.

the E-M algorithm produces progressively better reconstructions up to a point, and then decisively degrades. In both applications there are many solutions that are essentially consistent with the data, and this, in itself, contributes importantly to variance. A manifestation of the same phenomenon occurs in a simpler setting, when fitting data with a polynomial whose degree is higher than the number of points in the data set. Many different polynomials are consistent with the data, and the actual solution reached may depend critically on the algorithm used as well as on the initialization. Returning to backpropagation, we consistently found in the experi-

S. Geman, E. Bienenstock, and R. Doursat

32

3

5

10

IS

m

# Hidden Units

Figure 16: Total error, bias, and variance of feedforward neural network as a function of the number of hidden units. Training is by error backpropagation. For a fixed number of hidden units, the number of iterations of the backpropagation algorithm is chosen to minimize total error.

ments with handwritten numerals that better results could be achieved by stopping the gradient descent well short of convergence; see, for example, Chauvin (1990) and Morgan and Bourlard (1990) who report on similar findings. Keeping in mind these observations, we have plotted, in Figure 16, bias, variance, and total mean-squared error as a function of the number of hidden units, where for each number of hidden units we chose the optimal number of learning steps (in terms of minimizing total error). Each entry is the result of 50 trials, as explained previously, with the sole exception of the last experiment. In this experiment, involving 24 hidden units, only 10 trials were used, but there was very little fluctuation around the point depicting (averaged) total error.

Neural Networks and the Bias/Variance Dilemma

33

The basic trend is what we expect: bias falls and variance increases with the number of hidden units. The effects are not perfectly demonstrated (notice, for example, the dip in variance in the experiments with the largest numbers of hidden units), presumably because the phenomenon of overfitting is complicated by convergence issues and perhaps also by our decision to stop the training prematurely. The lowest achievable mean-squared error appears to be about 2. 4 Balancing Bias and Variance

This section is a brief overview of some techniques used for obtaining optimal nonparametric estimators. It divides naturally into two parts: the first deals with the finite-sample case, where the problem is to do one’s best with a given training set of fixed size; the second deals with the asymptotic infinite-sample case. Not surprisingly, the first part is a review of relatively informal “recipes,” whereas the second is essentially mathematical. 4.1 Automatic Smoothing. As we have seen in the previous section, nonparametric estimators are generally indexed by one or more parameters which control bias and variance; these parameters must be properly adjusted, as functions of sample size, to ensure consistency, that is, convergence of mean-squared error to zero in the large-sample-size limit. The number of neighbors k, the kernel bandwidth GT,and the number of hidden units play these roles, respectively, in nearest-neighbor, Parzenwindow, and feedforward-neural-network estimators. These “smoothing parameters” typically enforce a degree of regularity (hence bias), thereby “controlling” the variance. As we shall see in Section 4.2, consistency theQrems specify asymptotic rates of growth or decay of these parameters to guarantee convergence to the unknown regression, or, more generally, to the object of estimation. Thus, for example, a rate of growth of the number of neighbors or of the number of hidden units, or a rate of decay of the bandwidth, is specified as a function of the sample size N . Unfortunately, these results are of a strictly asymptotic nature, and generally provide little in the way of useful guidelines for smoothingparameter selection when faced with a fixed and finite training set V. It is, however, usually the case that the performance of the estimator is sensitive to the degree of smoothing. This was demonstrated previously in the estimation experiments, and it is a consistent observation of practitioners of nonparametric methods. This has led to a search for ”automatic” or “data-driven” smoothing: the selection of a smoothing parameter based on some function of the data itself. The most widely studied approach to automatic smoothing is **crossvalidation.” The idea of this technique, usually attributed to Stone (19741, is as follows. Given a training set V N = { ( x ~ , y ~.).., , (XN,YN)} and a

34

S . Geman, E. Bienenstock, and R. Doursat

"smoothing parameter" X, we denote, generically, an estimator by f(x;N,A, D N ) [see, for example, (3.6) with X = k or (3.7) with X = r r ] . Cross-validation is based on a "leave-one-out" assessment of estimation performance. Denote by V ( ' ) N1, 5 i 2 N, the data set excluding the ith observation ( x , , y ; ) : D(')N= { ( x l , y l ) , . . . , ( x i - l , y i - l ) , ( x i + ! , y i + l ) , . . ., ( x N , Y N ) } . Now fix A and form the estimator f ( x ; N - 1,X , D ( ' ) N ) , which is independent of the excluded observation ( x ; , y i ) . We can "test," or "cross-validate," on the excluded data: if f ( x , ; N -- 1,A, D$)) is close to yi, then there is some evidence in favor of f(x; N, A, D N )as an estimator of E[y I XI, at least for large N, wherein we do not expect f ( x ; N - 1,A , D i ' ) and f(x; N,A, DN) to be very different. Better still is the pooled assessment

The cross-validated smoothing parameter is the minimizer of A( A), which we will denote by A*. The cross-validated estimator is thenf(x;N, A', D N ) . Cross-validation is computation-intensive. In the worst case, we need to form N estimators at each value of A, to generate A(A), and then to find a global minimum with respect to A. Actually, the computation can often be very much reduced by introducing closely related (sometimes even better) assessment functions, or by taking advantage of special structure in the particular function f ( x ; N,A, D N )at hand. The reader is referred to Wahba (1984,1985) and OSullivan and Wahba (1985) for various generalizations of cross-validation, as well as for support of the method in the way of analytic arguments and experiments with numerous applications. In fact, there is now a large statistical literature on cross-validation and related methods (Scott and Terrell 1987; Hardle et al. 1988; Marron 1988; Faraway and Jhun 1990; Hardle 1990 are some recent examples), and there have been several papers in the neural network literature as well - see White (1988,19901, Hansen and Salamon (19901, and Morgan and Bourlard (1990). Computational issues aside, the resulting estimator, f ( x ; N,A*, D N ) ,is often strikingly effective, although some "pathological" behaviors have been pointed out (see, tor example, Schuster and Gregory 1981). In general, theoretical underpinnings of cross-validation remain weak, at least as compared to the rather complete existing theory of consistency for the original (not cross-validated) estimators. Other mechanisms have been introduced with the same basic design goal: prevent overfitting and the consequent high contribution of variance to mean-squared error (see, for example, Mozer and Smolensky 1989 and Karnin 1990 for some "pruning" methods for neural networks). Most of these other methods fall into the Bayesian paradigm, or the closely related method of regularization. In the Bayesian approach, likely regularities are specified analytically, and a priori. These are captured in a prior probability distribution, placed on a space of allowable input-to-output mappings ("machines"). It is reasonable to hope that estimators then

Neural Networks and the BiasNariance Dilemma

35

derived through a proper Bayesian analysis would be consistent; there should be no further need to control variance, since the smoothing, as it were, has been imposed a priori. Unfortunately, in the nonparametric case it is necessary to introduce a distribution on the infinite-dimensional space of allowable mappings, and this often involves serious analytical, not to mention computational, problems. In fact, analytical studies have led to somewhat surprising findings about consistency or the lack thereof (see Diaconis and Freedman 1986). Regularization methods rely on a “penalty function,” which is added to the observed sum of squared errors and promotes (is minimum at) ”smooth,” or “parsimonious,” or otherwise “regular” mappings. Minimization can then be performed over a broad, possibly even infinitedimensional, class of machines; a properly chosen and properly scaled penalty function should prevent overfitting. Regularization is very similar to, and sometimes equivalent to, Bayesian estimation under a prior distribution that is essentially the exponential of the (negative) penalty function. There has been much said about choosing the ”right” penalty function, and attempts have been made to derive, logically, informationbased measures of machine complexity from “first principles” (see Akaike 1973; Rissanen 1986). Regularization methods, complexity-based as well as otherwise, have been introduced for neural networks, and both analytical and experimental studies have been reported (see, for example, Barron 1991; Chauvin 1990). 4.2 Consistency and Vapnik-cervonenkis Dimensionality. The study of neural networks in recent years has involved increasingly sophisticated mathematics (cf. Barron and Barron 1988; Barron 1989; Baum and Haussler 1989; Haussler 1989b; White 1989, 1990; Amari 1990; Amari et al. 1990; Azencott 1990; Baum 1990a1, often directly connected with the statistical-inferenceissues discussed in the previous sections. In particular, machinery developed for the analysis of nonparametric estimation in statistics has been heavily exploited (and sometimes improved on) for the study of certain neural network algorithms, especially least-squares algorithms for feedforward neural networks. A reader unfamiliar with the mathematical tools may find this more technical literature unapproachable. He may, however, benefit from a somewhat heuristic derivation of a typical (and, in fact, much-studied) result: the consistency of leastsquares feedforward networks for arbitrary regression functions. This is the purpose of the present section: rather than a completely rigorous account of the consistency result, the steps below provide an outline, or plan of attack, for a proper proof. It is in fact quite easy, if somewhat laborious, to fill in the details and arrive at a rigorous result. The nontechnically oriented reader may skip this section without missing much of the more general points that we shall make in Sections 5 and 6. Previously, we have ignored the distinction between a random variable on the one hand, and an actual value that might be obtained on

S. Geman, E. Bienenstock, and R. Doursat

36

making an observation of the random variable on the other hand. In this discussion of consistency, we will be more careful and adopt the usual convention of denoting random variables by upper-case letters, and their values by the corresponding lower-case letters. In the general regression problem, there are two random vectors, X and Y,which we might think of as the argument and corresponding value of some unknown, and possibly random, function. We observe N independent samples with values DN = {(x,.y,), .... (xN,YN)}. Based on this "training set," we wish to learn to accurately predict Y from the "input" X. Because there is nothing special about the mathematics of learning a uector relation per se, we will simplify our discussion by treating X and Y as scalars X and Y. We will continue to use mean-squared error,

to evaluate the accuracy of a function f as a predictor of Y. We recall (see 2.2) that

E [(Y - f ( X ) ) 2 ] = E [ ( f ( X )- E[Y I XI)']

+ E [(Y - E[Y I XI)']

(4.1)

where E [ . ] means expectation (averaging) with respect to the joint distribution on X and Y. Since the second term of the right-hand side does not involve f , we will, as usual, adopt E [ c f ( X ) - E[Y I in evaluating f as a predictor of Y from X . The actual estimator is drawn from some class of functions that we will denote by FM.The primary example that we have in mind is a class of feedforward networks with M hidden units. Depending on details about the distribution of X and Y, and about the architecture of machines in 3 M , it may be necessary to restrict the magnitudes of the "synaptic weights," for example, IW,~I 5 & for all weights {wij}, where [jM 1 00 as the number of hidden units, M, is increased to infinity. Given M and a training set DN of size N , we define now our estimator f ( x ; N , M, DN)to be the best fit to the data within the class FM: N

(4.2)

Of course, actually getting this solution may be a very difficult, perhaps even intractable, computational problem. Our discussion of consistency necessarily concerns the true minimizer of 4.2, rather than the actual output of an algorithm designed to minimize 4.2, such as error backpropagation. In practice, there are serious convergence issues for such algorithms, with, unfortunately, very few analytical tools to address them. Also, it may be that 4.2 has multiple global minima. This is only a technical complication, as we could replace f ( x ; N,M, D N ) ,in what follows, by the set of minimizers. We shall therefore assume that the minimization is unique.

NeuraI Networks and the Bias/Variance Dilemma

37

For large N , we would hope to find f ( x ;N , M , D N ) ”close” to E[Y I X I . Let us denote by f M the best that can be done within the family FM:

j , = arg min E f€FM

[v(x) E[Y I XI)’] -

(4.3)

Of course, if FM is ”too small,” or at least if E[Y 1 x] is not well approximated by any element in FM,then f ( x ; N , M , D N ) cannot be a very good estimator of the regression. But later we will make FM ”large” by taking M 1‘ 00. For now, let us compare f ( x ;N , M, D N )to f M , the best solution available in .%. We will argue that, under appropriate conditions, f ( x ;N , M, DN) is essentially as good as ~ M ( x ) . Notice that for any fixed f E FM,the N numbers bi -f(xi)]’, 1 5 i 5 N, are independent and identically distributed (”i.i.d.”) observations of the random variable [Y - f(X)]’. Therefore, we expect that ( l / N )&bi f(xi)]’ is well approximated by E[(Y-f(X))’] (the “law of large numbers,” or “LLN). With this in mind, we can proceed to bound the meansquared error of f ( X ; N , M , V N ) :

=

E [ ( f ( X ; N , M , D N-) E[Y I XI)‘] (as before - see 4.1) E [ (Y - f ( X ; N , M , a,))’]

-E [ ( Y - E[Y I XI)’] N N

(LLN)

< (by defn. - see 4.2) N

(LLN)

=

(again, see 4.1)

-

l N -C [yi -f(Xi;N,M,DN)]’ N i=l

-E [(Y - E[Y t XI)’] l N

[ y j -fM(Xi)]’ - E [ ( Y - E[Y I XI)’] N j=l E [(Y - f m ) ’ ] - E [(Y - E[Y I XI)’] E [ ( ~ M (X E[Y ) I XI)’]

-

minE [ ( f ( X )- E[Y I XI)’]

fE3M

We thus obtained the desired result: f ( x ;N , M, DN)is asymptotically optimal in the class FM. Although this reasoning is essentially correct, we are still two fairly big steps away from a rigorous consistency argument. The first gap to be filled in has to do with the bias of the function fM: E[(f,,(X)- E[Y 1 XI)’] is not likely to be zero, hence we have no reason to believe that, even as N + 00, E [ V ( X ; N , M , D N) E[Y I XI)’] -+ 0. In other words, since F M may not (probably does not) contain E[Y 1 x ] ,there is a residual bias. This is remedied by adding more hidden units: we take M = M N 1‘ co as N -+ co. The reason why this indeed eliminates residual bias is that the class FM is asymptotically (M -+ 00) dense in the space of all “reasonable” functions. (This is an often-quoted result in the neuralmodeling literature. Of course it depends on details about architecture

S. Geman, E. Bienenstock, and R. Doursat

38

and the particular “neuronal response function” used - see Barron 1989; Cybenko 1989; Funahashi 1989; Hornik et al. 1989; Hartman et al. 1990.) That is to say, for any (measurable) E[Y I X I , there exists a sequence gM E F~ such that E[(gM(X)- E[Y I XI)’] + o as M + m. In particular, the sequence f~ defined in 4.3 will have this property. The second problem is more difficult to solve, and is moreover confounded by the evident necessity of taking M = MN T 00. The difficulty lies in our (first) use of the law of large numbers. It is true that

f(xi)]’ -+ E [ (Yas N + 00, for fixed functions f E FM.This is so because, as we have noted, [yi - f(xi)]’ are i.i.d. realizations of a random variable. However, the function f ( x ;N, M , D N ) depends on all of the xi’s and yi’s; therefore, the numbers { (yi - f(xi;N , M , DN))2}are tzof i.i.d. observations. They are coupled by the minimization procedure that defines f ( x ; N , M , DN), and this coupling introduces dependence. One rather crude solution consists in proving a uniform law of large numbers: (4.4) Then, in particular, lim N-m

l N -

N i=l

[yi

-

f ( x i ;N , M , .ON)]’ = E [(Y- f(X))’]

Recall however that we must take M = M N T 03 to eliminate any residual bias. Therefore, what we actually need is a result like (4.5)

, be increasing with N , so In most cases, the class of machines, F M ~will 4.5 is stronger than 4.4. But it is not much more difficult to prove. In fact, it actually follows from 4.4, provided that 4.4 can be established along with bounds on the rate of convergence. One then specifies a sequence M = MN, increasing to infinity sufficiently slowly, so that 4.5 is true as well. We will forgo the details of making this extension (from 4.4 to 4.5), and concentrate instead on the fundamental problem of establishing a uniform LLN, as in 4.4. Recall that for every element f E FM we have 1 / N CEIIyi - f(xi)]’ + E[(Y-f(X))’] as N + 00, by the ordinary LLN. In this case, moreover, we know essentially everything there is to know about the rate of convergence. One way to prove 4.4 is then to cover FM with ”small balls” (see, for example, Grenander 1981; Geman and Hwang 1982). We judiciously

Neural Networks and the Bias/Variance Dilemma

39

choose fl,f*, . . . , f ~E 3 ’ such that every other f € F ,is close to one of these (inside one of L small balls centered at the fi’s). Since convergence for each fi, 1 5 i 5 L, always guarantees uniform convergence for the finite set (fi}f.=l, and since all other f are close to one of these fi’s, 4.4 is shown to be ”nearly true.” Finally, taking L cc (so that the balls can get smaller), 4.4 can be rigorously established. The modern approach to the problem of proving 4.4 and 4.5 is to use the Vapnik2ervonenkis dimension. Although this approach is technically different, it proceeds with the same spirit as the method outlined above. Evidently, the smaller the set 3 M , the easier it is to establish 4.4, and in fact, the faster the convergence. This is a direct consequence of the argument put forward in the previous paragraph. The Vapnikeervonenkis approach “automates” this statement by assigning a size, or dimension, to a class of functions. In this case we would calculate, or at least bound, the size or Vapnikzervonenkis dimension of the class of functions of (x,y) given by --$

((Y -

(4.6)

For a precise technical definition of Vapnikzervonenkis dimension (and some generalizations), as well as for demonstrations of its utility in establishing uniform convergence results, the reader is referred to Vapnik (1982), Pollard (1984), Dudley (1987), and Haussler (1989a). Putting aside the details, the important point here is that the definition can be used constructiuely to measure the size of a class of functions, such as the one defined in 4.6. The power of this approach stems from generic results about the rate of uniform convergence (see, e.g., 4.41, as a function of the Vapnik-eervonenkis dimension of the corresponding function class, see, e.g., 4.6. One thereby obtains the desired bounds for 4.4, and, as discussed above, these are rather easily extended to 4.5 by judicious choice of M = M N T ~ 0 . Unfortunately, the actual numbers that come out of analytical arguments such as these are generally discouraging: the numbers of samples needed to guarantee accurate estimation are prohibitively large. (See, for example, the excellent paper by Haussler 1989b, in which explicit upper bounds on sample sizes are derived for a variety of feedforward networks.) Of course these analytical arguments are of a general nature. They are not dedicated to particular estimators or estimation problems, and therefore are not “tight”; there may be some room for improvement. But the need for large sample sizes is already dictated by the fact that we assume essentially no a priori information about the regression E[Y I It is because of this assumption that we require uniform convergence results, making this a kind of ”worst case analysis.” This is just another view of the dilemma: if we have a priori information about E[Y I x] then we can employ small sets FM and achieve fast convergence, albeit at the risk of large bias, should E[Y I x ] in fact be far from F’.

XI.

S. Geman, E. Bienenstock, and R. Doursat

40

We end this section with a summary of the consistency argument: Step 1. Check that the class FM is rich enough, that is, show that the - E[Y 1 -+ 0 as M + 00. sequence f~ (see 4.3) is such that E [ ( ~ M Step 2. Establish a uniform LLN:

together with a (probabilistic) rate of convergence (eg., with the help of Vapnikxervonenkis dimensionality). Step 3. Choose M = MN T M replaced by MN.

00

sufficiently slowly that 4.7 is still true with

Step 4. Put together the pieces:

5 interpretation and Relevance to Neural Networks Let us briefly summarize the points made thus far. We first remarked that the goals one is trying to achieve with artificial neural networks, particularly of the feedforward layered type (multilayer perceptrons), generally match those of statistical inference: the training algorithms used to adjust the weights of such networks - for example, the error-backpropagation algorithm - can be interpreted as algorithms for statistical inference. We further mentioned that although learning precisely consists of adjusting a collection of parameters, namely the “synaptic weights” of the network,

Neural Networks and the BiasNariance DiIemma

41

such networks with their associated learning algorithms really belong to the class of nonparametric inference schemes, also called model-free methods. Nonparametric methods may be characterized by the property of consistency: in the appropriate asymptotic limit they achieve the best possible performance for any learning task given to them, however difficult this task may be. We saw that in many tasks performance is adequately measured by mean-squared error, and that optimal performance is achieved by the conditional mean, or "regression" of the output on the input: this is, among all possible functions, the one that minimizes mean-squared error. We also saw that mean-squared error can be decomposed into a bias term and a variance term. Both have to be made small if we want to achieve good performance. The practical issue is then the following: Can we hope to make both bias and variance "small," with "reasonably" sized training sets, in "interesting" problems, using nonparametric inference algorithms such as nearest-neighbor, CART, feedforward networks, etc.? It is one of the main goals of the present paper to provide an answer to this question, and we shall return to it shortly. Let us, however, immediately emphasize that the issue is purely about sample size, and quite distinct from implementation considerations such as the speed of the hardware, the parallel versus serial or analog versus digital type of machine, or the number of iterations required to achieve convergence in the case of the backpropagation algorithm. 5.1 Neural Nets and Nonparametric Inference for Machine Learning and Machine Perception Tasks. We mentioned that the focus of most connectionist models is the problem of inferring relationships among a set of observable variables, from a collection of examples called a training set. This is also the focus of the statistical sciences, so it is not surprising that statistical tools are increasingly exploited in the development and analysis of these kinds of neural models (Lippmann 1987; Barron and Barron 1988; Gallinari et al. 1988; Barron 1989; Haussler 1989a; Tishby et al. 1989; White 1989; Amari ef al. 1990; Baum 1990b; Hinton and Nowlan 1990). Thus, the perceptron (Rosenblatt 1962) and other adaptive pattern classifiers (e.g., Amari 1967) are machines for computing linear decision boundaries; the "Brain State in a Box" model of categorical perception (Anderson ef al. 1977) is related to factor analysis; Boltzmann Machines (Ackley ef al. 1985; Hinton and Sejnowski 1986) compute (approximate) maximum-likelihood density estimators; and backpropagation networks realize an algorithm for nonparametric least-squares regression. Backpropagation networks can also be trained to achieve transformations related to principal-component analysis (Bourlard and Kamp 1988, Baldi and Hornik 1989). A good state-of-the-art statistical method for highdimensional data analysis is projection pursuit (Friedman and Stuetzle 1981, Huber 1985). It may then be a good strategy to start from a statistical method such as this and to look for neural-like realizations of it,

42

S. Geman, E. Bienenstock, and R. Doursat

thereby suggesting efficient parallel, and possibly even ultrafast, analog, implementations. Examples of networks based upon projection pursuit can be found in Intrator (1990) and Maechler ef ul. (1990). Modern nonparametric statistical methods, and hence many recent neural models, are important tools for wide-ranging applications. Two rather natural applications were discussed in Section 2: the General Motors foam casting problem and the problem of evaluating loan applications. There are no doubt many other applications in economics, medicine, and more generally in modern data analysis. Nevertheless, the enthusiasm over neural modeling is mostly about different kinds of problems. Indeed, anybody remotely connected to the field knows, if only from his or her mail, that much more is expected from neural networks than making additions to the statistician’s toolbox. The industrial and military scientists, and to some extent academia, are poised for the advances in machine intelligence that were once anticipated from artificial intelligence. There is, for example, much enthusiasm anticipating important advances in automatic target recognition, and more generally in invariant object recognition. In speech processing there have been successes in isolated phoneme recognition (Waibel et al. 1988; Lippmann 1989) and there is a suggestion of neural networks (or other nonparametric methods) as good “front-ends” for hidden Markov models (Bourlard and Wellekens 1990), and, beyond this, of advances in continuous speech recognition via trained neural networks, avoiding the difficult task of estimating parameters in complex hidden Markov models. Further, there is the hope of building expert systems without ”engineering knowledge”: neural networks can learn by example. In this regard, evaluating loans is indeed a modest start. Typical applications of expert systems would involve many more variables, and would need to predict more than just a small number of possible outcomes. Finally, it should not be forgotten that from their inception neural networks were also, if not chiefly, meant to contribute to the understanding of real brains. The debate about their adequacy as models of cognition is probably more intense now than ever (Fodor and Pylyshyn 1988; Smolensky 1988). From at least one point of view the optimism about neural networks is well-founded. The consistency theorems mentioned in Sections 2 and 4 guarantee a (suitably formulated) optimal asymptotic performance. Layered networks, Boltzmann Machines, and older methods like nearestneighbor or window estimators, can indeed form the basis of a trainable, “from scratch,” speech recognition system, or a device for invariant object recognition. With enough examples and enough computing power, the performance of these machines will necessarily approximate the best possible for the task at hand. There would be no need for preprocessing or devising special representations: the ”raw data” wouId do. Is this hope indeed realistic? Also, is there any reason to believe that neural networks will show better performances than other nonparametric

Neural Networks and the Bias/Variance Dilemma

43

methods with regard to difficult problems that are deemed to require some form of "intelligence"? As we have seen, the question really boils down to the following: Can training samples be large enough to make both bias and variance small? To get a feeling about this issue, consider for a moment the problem of recognizing all nonambiguous handwritten characters. This is somewhat ill-defined, but we mean, roughly, the following. The input X is a digitized raw image of a single segmented character, handwritten, drawn, or etched, using any kind of tool or process, on any kind of material. The distribution of inputs includes various styles of writing, various sizes, positions, and orientations of the character in the image, various widths of stroke, various lighting conditions, various textures, shadings, etc. Images may moreover include substantial "noise" and degradations of various sorts. It is assumed that in spite of the variability of the data, the conditional distribution P(Y I X) is degenerate for all X: a human observer provides a perfectly nonambiguous labeling Y for any X drawn from this distribution. By definition, an optimal classifier for this task achieves zero mean-squared error, since the labeling Y is a deterministic function of the input X. This general problem is certainly more difficult than the hardest of character recognition problems actually solved today, for example, by neural methods (cf. Le Cun et al. 1989; Martin and Pittman 1991). On the other hand, insofar as this problem is well defined, consistency theorems apply to it and guarantee optimal performance, in this case zero-error classification. One should thus be able to devise a sequence of machines that would, in the limit, when trained on enough data drawn from the given distribution of (X,Y), perform the task just as accurately as the human observer, that is, never fail, since we assumed nonambiguous labeling. In reality, the reason why we are still quite far from building optimal performance machines for this problem is the wide gap between the theoretical notion of consistency - an asymptotic property - and conditions realized in practice. As we have seen in Section 4, consistency requires that the size of the training set grows to infinity, and that the algorithm S i ~ ~ Z ~ ~ adapts n e ~ ~itself s Zto~ larger and larger training sets. In essence, the machine shouId become more and more versatile, that is, eliminate all biases. On the other hand, elimination of bias should not happen too fast, lest the machine become dedicated to the idiosyncrasies of the training examples. Indeed, we have seen that for any finite-size training set the price to pay for low bias is high variance. In most cases, as we have seen in Section 3, there is a "smoothing parameter" whose value can be adjusted to achieve the very best bias/variance tradeoff for any fixed size of the training set. However, even with the best compromise, an estimator can still be quite far from optimal. Only when the size of the training set literally grows to infinity, can one eliminate at the same

44

S. Geman, E. Bienenstock, and R. Doursat

time both bias and variance. This justifies the term "dilemma," and the consequence is prohibitively slow convergence. In practice, the size of the training set for our "general" character recognition problem will always be considerably smaller than would be required for any nonparametric classification scheme to meaningfully approximate optimality. In other words, for complex perceptual tasks such as this, a "sufficiently large training set" exists only in theory. 5.2 Interpolation and Extrapolation. The reader will have guessed by now that if we were pressed to give a yes/no answer to the question posed at the beginning of this chapter, namely: "Can we hope to make both bias and variance 'small,' with 'reasonably' sized training sets, in 'interesting' problems, using nonparametric inference algorithms?" the answer would be no rather than yes. This is a straightforward consequence of the bias/variance "dilemma." Another way to look at this stringent limitation is that if a difficult classification task is indeed to be learned from examples by a general-purpose machine that takes as inputs raw unprocessed data, then this machine will have to "extrapolate," that is, generalize in a v e y nontrivial sense, since the training data will never "cover" the space of all possible inputs. The question then becomes: What sorts of rules do conventional algorithms follow when faced with examples not suitably represented in the training set? Although this is dependent on the machine or algorithm, one may expect that, in general, extrapolation will be made by "continuity," or "parsimony." This is, in most cases of interest, not enough to guarantee the desired behavior. For instance, consider again the sinusoid-within-rectangle problem discussed in Section 3. Suppose that after training the machine with examples drawn within the rectangle, we ask it to extrapolate its "knowledge'' to other regions of the plane. In particular, we may be interested in points of the plane lying far to the left or to the right of the rectangle. If, for instance, the k-nearest-neighbor scheme i s used, and if both k and the size of the training set are small, then it is fairly easy to see that the extrapolated decision boundary will be very sensitive to the location of training points at the far extremes of the rectangle. This high variance will decrease as k and the sample size increase: eventually, the extrapolated decision boundary will stabilize around the horizontal axis. Other schemes such as Parzen windows and layered networks will show similar behavior, although the details will differ somewhat. At any rate, it can be seen from this example that extrapolation may be to a large extent arbitrary. Using still the same example, it may be the case that the number of training data is too small for even a good interpolation to take place. This will happen inevitably if the size of the training sample is kept constant while the number of periods of the sinusoid in the rectangle, that is, the complexity of the task, is increased. Such a learning task will defeat general nonparametric schemes. In reality, the problem has now become

Neural Networks and the Bias /Variance Dilemma

45

extrapolation rather than interpolation, and there is no a priori reason to expect the right type of extrapolation. One recourse is to build in expectations: in this particular case, one may favor periodic-type solutions, for.instance by using estimators based on Fourier series along the xaxis. Evidently, we are once more facing the bias/variance dilemma: without anticipating structure and thereby introducing bias, one should be prepared to observe substantial dependency on the training data. If the problem at hand is a complex one, such as the high-frequency twodimensional sinusoid, training samples of reasonable size will never adequately cover the space, and, in fact, which parts are actually covered will be highly dependent on the particular training sample. The situation is similar in many real-world vision problems, due to the high dimensionality of the input space. This may be viewed as a manifestation of what has been termed the ”curse of dimensionality” by Bellman (1961). The fundamental limitations resulting from the bias-variance dilemma apply to all nonparametric inference methods, including neural networks. This is worth emphasizing, as neural networks have given rise in the last decade to high expectations and some surprising claims. Historically, the enthusiasm about backpropagation networks stemmed from the claim that the discovery of this technique allowed one, at long last, to overcome the fundamental limitation of its ancestor the perceptron, namely the inability to solve the ”credit (or blame) assignment problem.” The hope that neural networks, and in particular backpropagation networks, will show better generalization abilities than other inference methods, by being able to develop clever “internal representations,” is implicit in much of the work about neural networks. It is indeed felt by many investigators that hidden layers, being able to implement any nonlinear transformation of the input space, will use this ability to “abstract the regularities” from the environment, thereby solving problems otherwise impossible or very difficult to solve. In reality, the hidden units in a layered network are a nonlinear device that can be used to achieve consistency like many others. There would seem to be no reason to expect sigmoid functions with adaptive weights to do a significantly better job than other nonlinear devices, such as, for example, gaussian kernels or the radial basis functions discussed by Poggio and Girosi (1990). Consistency is an asymptotic property shared by all nonparametric methods, and it teaches us all too little about how to solve difficult practical problems. It does not help us out of the bias/variance dilemma for finite-size training sets. Equivalently, it becomes irrelevant as soon as we deal with extrapolation rather than interpolation. Unfortunately, the most interesting problems tend to be problems of extrapolation, that is, nontrivial generalization. It would appear, then, that the only way to avoid having to densely cover the input space with training examples - which is unfeasible in practice - is to prewire the important generalizations.

46

S. Geman, E. Bienenstock, and R. Doursat

In light of these rather pessimistic remarks, one is reminded of our earlier discussion (see Section 2.3) of some successful applications of nonparametric methods. Recall that General Motors, for example, made an important reduction in the scrap rate of Styrofoam castings after building a nonparametric classifier based on the CART procedure. The input, or feature, space comprised 80 process variables. It was not reasonable to hope that the 470 training examples would meaningfully cover the potentially achievable settings of these variables. Certainly, extrapolation to regions not represented would have been hazardous, at least without a believable model for the relationship between castings and process variables. But this was not a problem of extrapolation; the goal was not to learn the relationship between casting success and process variables per se, but rather to identify an achievable range of process variables that would ensure a high likelihood of good castings. With this more modest goal in mind, it was not unreasonable to anticipate that a data set with substantial variation in the settings of the process variables would help locate regions of high likelihood of success. Also discussed in Section 2.3 was the application of a neural network learning system to risk evaluation for loans. In contrast to the Styrofoam casting problem, there is here the luxury of a favorable ratio of trainingset size to dimensionality. Records of many thousands of successful and defaulted loans can be used to estimate the relation between the 20 or so variables characterizing the applicant and the probability of his or her repaying a loan. This rather uncommon circumstance favors a nonparametric method, especially given the absence of a well-founded theoretical model for the likelihood of a defaulted loan. 6 Designing Bias If, as we have seen in the previous chapter, the asymptotic property of consistency does not help us much in devising practical solutions to the more substantial’problems of machine learning and machine perception, how could one improve on the capabilities of existing algorithms? It is sometimes argued that the brain is a proof of existence of near-optimal methods that do not require prohibitively large training samples. Indeed, in many cases, we do learn with remarkable speed and reliability. Language acquisition offers a striking example: children often memorize new words after hearing them just once or twice. Such ”one-shot” learning has apparently little to do with statistical inference. Without going to such extremes, does not the simple observation that quick and reliable perceptual learning exists in living brains contradict the conclusions of the previous chapter? The answer is that the bias/variance dilemma can be circumvented if one is willing to give up generality, that is, purposefully introduce bias. In this way, variance can be eliminated, or significantly reduced. Of

Neural Networks and the Bias/Variance Dilemma

47

course, one must ensure that the bias is in fact harmless for the problem a t hand: the particular class of functions from which the estimator is to be drawn should still contain the regression function, or at least a good approximation thereof. The bias will then be harmless in the sense that it will contribute significantly to mean-squared error only if we should attempt to infer regressions that are not in the anticipated class. In essence, bias needs to be designed for each particular problem. For a discussion of this point in a psychological perspective, and some proposals for specific regularities that living brains may be exploiting when making nontrivial generalizations, see Shepard (1989). Similar suggestions have been made by several other authors in the specific context of neural networks (cf. Anderson et al. 1990). Indeed, it has been found that for many problems a constrained architecture can do better than a general-purpose one (Denker et al. 1987; Waibel et al. 1988; Le Cun et al. 1989; Solla 1989). This observation has a natural explanation in terms of bias and variance: in principle, the synaptic weights in a "generalist" neural network should converge to a satisfactory solution if such a solution exists, yet in practice this may be unfeasible, as a prohibitive number of examples are required to control the variance (not to mention the computational problem of identifying good minima in "weight space"). In some cases, a set of simple constraints on the architecture, that is, a bias, will essentially eliminate the variance, without at the same time eliminating the solution from the family of functions generated by the network. A simple example of such a situation is the socalled contiguity problem (Denker et ul. 1987; Solla 1989). In a statistical physics perspective, introducing bias may also be viewed as a means of decreasing an appropriately defined measure of entropy of the machine (Carnevali et al. 1987; Denker et al. 1987; Tishby et al. 1989; Schwartz et al. 1990). In many cases of interest, one could go so far as to say that designing the right biases amounts to solving the problem. If, for example, one could prewire an invariant representation of objects, then the burden of learning complex decision boundaries would be reduced to one of merely storing a label. Such a machine would indeed by very biased; it would, in fact, be incapable of distinguishing among the various possible presentations of an object, up-side-up versus up-side-down, for example. This is, then, perhaps somewhat extreme, but the bias/variance dilemma suggests to us that strong a priori representations are unavoidable. Needless to say, the design of such representations, and other biases that may be essential, for example, to auditory perception or to other cognitive tasks, is a formidable problem. Unfortunately, such designs would appear to be much more to the point, in their relevance to real brains, than the study of nonparametric inference, whether neurally inspired or not. This suggests that the paradigm of near tabula rasa learning (i.e., essentially unbiased learning), which has been so much emphasized in

48

S. Geman, E. Bienenstock, and R. Doursat

the neural-computing literature of the last decade, may be of relatively minor biological importance. It may still be a good idea, for example, for the engineer who wants to solve a task in machine perception, to look for inspiration in living brains. In the best of all cases, this could allow him or her to discover the nature of the biases ”internalized” during the course of phylogenetic and ontogenetic evolution. However, the hope that current connectionist networks already inherit the biases of real brains from the mere fact that they are built from “brain-like” processing elements seems farfetched, to say the least. Indeed, one could reasonably question the extent to which connectionist networks adequately reflect the basic principles of anatomy and physiology of living brains (see, e.g., Crick 1989). 6.1 Further Experiments with Handwritten Numerals. We have performed some further experiments on handwritten-numeral recognition for the purpose of illustrating the possible advantages of forgoing generality in a challenging inference problem, and concentrating, instead, on the design of appropriate bias. These experiments were inspired from a theory of neural coding (von der Malsburg 1981,1986; von der Malsburg and Bienenstock 1986) that emphasizes the role of accurate temporal correlations across neuronal populations. This theory leads to an alternative notion of distance between patterns (sensory stimuli) that accommodates, a priori, much of the invariance that would otherwise need to be learned from examples (Bienenstock and Doursat 1991). In brief, von der Malsburg and Bienenstock argue that living brains, exploiting the fine temporal structure of neural activity, are well-suited to the task of finding near-optimal, that is, topology-preserving, maps between pairs of labeled graphs. In a simple example, such graphs could be the nearest-neighbor, black-white, graphs defined by the 16 x 16 binary character arrays used in our handwritten-numeral experiments. These map computations give rise to a natural metric, which measures the extent to which one pattern needs to be distorted to match a second pattern. Shifts, for example, cost nothing, and two patterns that differ only by a shift are therefore deemed to be zero-distance apart. Small distortions, such as a stroke extending at a slightly different angle in one character than in another, incur only small costs: the distance between two such patterns is small. A very similar notion of distance has arisen in computer vision, via so-called deformable templates, for the purpose of object recognition (see Fischler and Elschlager 1973; Widrow 1973; Burr 1980; Bajcsy and Kovacic 1989; Yuille 1990). For applications to image restoration and pattern synthesis, as well as to recognition, see Grenander et al. (1990), Knoerr (1988), and Amit ef al. (1991). Given a pair of 16 x 16 binary images, x and x’, let us denote by m ( x ,x’) the “deformation metric” suggested by the von der MalsburgBienenstock theory. A formal definition of the metric m, as well as details behind the biological motivation and a more extensive discussion of

Neural Networks and the Bias/Variance Dilemma

49

experiments, can be found in Bienenstock and Doursat (1989, 1991) and in Buhmann et al. (1989). For our purposes, the important point is that m ( x ,x’) measures the degree of deformation necessary to best match the patterns represented by x and x’. Recall that the Parzen-window and nearest-neighbor methods require, for their full specification, some metric on the range of the input, x, and recall that we used the Hamming distance in our earlier experiments (see Section 3.5.5). Here, we experiment with k-nearest-neighbor estimation using the graph-matching metric, m, in place of Hamming distance. Of course by so doing we introduce a particular bias: small changes in scale, for example, are given less importance than when using Hamming distance, but this would seem to be highly appropriate for the task at hand. Figure 17 summarizes the results of these experiments. The task was the same as in Section 3.5.5, except that this time we added no noise to the discretized images of handwritten numerals. Examples of the uncorrupted numerals used for these experiments are shown in the top four rows of Figure 12. As in our previous experiments, the y-axis indicates mean-squared error (see Section 3.5.5), which can be used to approximate the percent misclassification by the rough rule of thumb: percentage misclassification = 20 x mean-squared error. There are three curves in Figure 17. Two of these give results from experiments with the k-nearestneighbor estimator: one using the graph-matching metric and the other, for comparison, using the Hamming distance. Also for comparison, a third curve gives results from experiments with the backpropagation network described in Section 3.5.3. As we have noted earlier, the neural net performance does not necessarily improve with each learning iteration. To make the most of the feedforward neural network, we have again used, for each number of hidden units, the optimal number of iterations (see Section 3.5.5). Note that the x-axis now serves two purposes: it indicates the value of k for the two k-nearest-neighbor curves, but also the number of hidden units for the neural network estimator; there is no correspondence between the two scales, the only purpose of this simultaneous display being the comparison of the performances in each of the three methods. We first observe that the best performances of the two nonparametric estimators (k-nearest-neighbor with Hamming distance and backpropagation network) are almost identical. This comes as no surprise since we observed similar results in our experiments with the noisy data in Section 3. The result of interest here is that when the image space is equipped with the graph-matching distance m, rather than the usual Hamming distance, the performance of the k-nearest-neighbor classifier improves significantly. More experiments, including other nonparametric schemes (Parzen windows as well as various neural networks) applied either to the raw image or to an image preprocessed by extraction of local features, confirm that the use of the graph-matching distance yields significantly better results on this task (Bienenstock and Doursat 1991).

S. Geman, E. Bienenstock, and R. Doursat

50

backpropgation --%

H : : ’ 5

IS

I0

20

k, H.U.

Figure 1 7 Classification of handwritten numerals: performance as a function of the number k of neighbors in a k-nearest-neighbor estimator (curves marked “Hamming” and “elastic matching”) and the number of hidden units in a feedforward neural network trained by error backpropagation (curve marked “backpropagation”). The two curves representing k-nearest-neighbor estimation are the results of experiments with two different measures of distance, and hence two notions of “neighbor.” The first is ordinary Hamming distance (the patterns are binary); the second is a measure of the deformation necessary to bring one pattern into another (the “elastic matching” metric).

Evidently then, the metric arising from graph matching is more suitable for the problem at hand than the straightforward Hamming distance, arising from the pixel-array representation. By adopting a different representation, we have introduced a very significant bias, thereby achieving a better control of the variance. We believe, more generally, that adopting

Neural Networks and the Bias/Variance Dilemma

51

an appropriate data representation is an efficient means for designing the bias required for solving many hard problems in machine perception. This view is of course shared by many authors. As Anderson and Rosenfeld (1988) put it: “A good representation does most of the work.”

7 Summary To mimic substantial human behavior such as generic object recognition in real scenes - with confounding variations in orientation, lighting, texturing, figure-to-ground separation, and so on - will require complex machinery. Inferring this complexity from examples, that is, learning it, although theoretically achievable, is, for all practical matters, not feasible: too many examples would be needed. Important properties must be built-in or “hard-wired,” perhaps to be tuned later by experience, but not learned in any statistically meaningful way. These conclusions and criticisms are probably shared by many authors. They can perhaps be argued most convincingly from the point of view of modern statistical theory, especially the theory of nonparametric inference. We have therefore presented a tutorial on the connection between nonparametric inference and neural modeling as it stands today, and we have used the statistical framework, together with some simple numerical experiments, to argue for the limitations of learning in neural modeling. Of course most neural modelers do not take tabula rasa architectures as serious models of the nervous system; these are viewed as providing a mere starting point for the study of adaptation and self-organization. Such an approach is probably meant as a convenience, a way of concentrating on the essential issue of finding neurally plausible and effective learning algorithms. It strikes us, however, that identifying the right “preconditions” is the substantial problem in neural modeling. More specifically, it is our opinion that categorization must be largely built in, for example, by the use of generic mechanisms for representation, and that identifying these mechanisms is at the same time more difficult and more fundamental than understanding learning per se.

Acknowledgments We are indebted to Isabelle Guyon and the AT&T Bell Laboratories for providing the data set of handwritten numerals used in our experiments, and to Nathan Intrator for a careful reading of the manuscript and many useful comments. S. G. was supported by Army Research Office Contract DAAL03-86-K-0171 to the Center for Intelligent Control Systems, National Science Foundation Grant DMS-8813699, Office of Naval Research

52

S. Geman, E. Bienenstock, and R. Doursat

Contract N00014-88-K-0289, and the General Motors Research Laboratories. E. B. was supported by grants from the Commission of European Communities (B.R.A.I.N. ST2J-0416) and the French MinistGre de la Recherche (87C0187).

References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory, B. N. Petrov and F. Csaki, eds., pp. 267-281. Akademia Kiado, Budapest. Amari, S. I. 1967. A theory of adaptive pattern classifiers. I E E E Transact. Elect. Computers EC-16, 299-307. Amari, S. I. 1990. Dualistic geometry of the manifold of higher-order neurons. Tech. Rep. METR 90-17, Department of Mathematical Engineering and Instrumentation Physics, University of Tokyo, Bunkyo-Ku, Tokyo. Amari, S. I., Kurata, K., and Nagaoka, H. 1990. Differential geometry of Boltzmann machines. Tech. Rep., Department of Mathematical Engineering and Instrumentation Physics, University of Tokyo, Bunkyo-Ku, Tokyo. Amit, Y., Grenander, U., and Piccioni, M. 1991. Structural image restoration through deformable templates. I. A m . Statist. Assoc. 86, 376-387. Anderson, J. A., and Rosenfeld, E. 1988. Neurocomputing, Foundations of Research, p. 587. MIT Press, Cambridge MA. Anderson, J. A., Silverstein, J. W., Ritz, S. A., and Jones, R. S. 1977. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychol. Rev. 84, 413451. Anderson, J. A., Rossen, M. L., Viscuso, S. R., and Sereno, M. E. 1990. Experiments with representation in neural networks: Object motion, speech, and arithmetic. In Synergetics of Cognition, H. Haken and M. Stadler, eds. Springer-Verlag, Berlin. Azencott, R. 1990. Synchronous Boltzmann machines and Gibbs fields: Learning algorithms. In Neurocomputing, Algorithms, Architectures and Applications, F. Fogelman-Soulie and J. Herault, eds., pp. 51-63. NATO AS1 Series, Vol. F68. Springer-Verlag, Berlin. Bajcsy, R., and Kovacic, S. 1989. Multiresolution elastic matching. Comput. Vision, Graphics, Image Process. 46, 1-21. Baldi, P., and Hornik, K. 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2, 53-58. Barron, A. R. 1989. Statistical properties of artificial neural networks. Proc. of the 28th Conf. Decision Control, Tampa, Florida, 280-285.

Neural Networks and the Bias/Variance Dilemma

53

Barron, A. R. 1991. Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics, G. Roussas, ed., pp. 561-576. Kluwer, Dordrecht. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In Computing Science and Statistics, Proceedings of the 20th Symposium Interface, E. Wegman, ed., pp. 192-203. American Statistical Association, Washington, DC. Baum, E. B. 1990a. The perceptron algorithm is fast for nonmalicious distributions. Neural Comp. 2, 248-260. Baum, E. B. 1990b. When are k-nearest-neighbor and backpropagation accurate for feasible-sized sets of examples? In Proceedings Eurasip Workshop on Neural Networks, L. B. Almeida and C. J. Wellekens, eds., pp. 2-25. Springer-Verlag, Berlin. Baum, E. B., and Haussler, D. 1989. What size net gives vaIid generalization? Neural Comp. 1, 151-160. Baum, L. E. 1972. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1-8. Bellman, R. E. 1961. Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bienenstock, E., and Doursat, R. 1989. Elastic matching and pattern recognition in neural networks. In Neural Networks: From Models to Applications, L. Personnaz and G. Dreyfus, eds., pp. 472-482. IDSET, Paris. Bienenstock, E., and Doursat, R. 1991. Issues of representation in neural networks, In Representations of Vision: Trends and Tacit Assumptions in Vision Reseauch, A. Gorea, ed., pp. 47-67. Cambridge University Press, Cambridge. Bourlard, H., and Kamp, Y. 1988. Auto-association by multi-layer perceptrons and singular value decomposition. Biol. Cybernet. 59, 291-294. Bourlard, H., and Wellekens, C. J. 1990. Links between Markov models and multilayer perceptrons. I E E E Transact. Pattern Anal. Machine Intell. 12, 11671178. Breiman, L., and Friedman, J. H. 1985. Estimating optimal transformations for multiple regression and correlation. 1. Am. Stat. Assoc. 80, 580-619. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA. Buhmann, J., Lange, J., von der Malsburg, Ch., Vorbruggen, J. C., and Wiirtz, R. P. 1989. Object recognition in the dynamic link architecture: Parallel implementation on a transputer network. In Neural Networks: A Dynamic Systems Approach f o Machine Intelligence, B. Kosko, ed. Prentice Hall, New York. Burr, D. J. 1980. Elastic matching of line drawings. IEEE Transact. Pattern Anal. Machine Intell. PAMI-3 6, 708-713. Carnevali, P., and Patarnello, S. 1987. Exhaustive thermodynamic analysis of Boolean learning networks. Europhys. Lett. 4, 1199-1204. Chauvin, Y. 1990. Dynamic behavior of constrained back-propagation networks. In Neural Information Processing Systems 11, D. S. Touretzky, ed., pp. 642-649.

54

S. Geman, E. Bienenstock, and R. Doursat

Morgan Kaufmann, San Mateo, CA. Collins, E., Ghosh, S., and Scofield, C. 1989. An application of a multiple neural network learning system to emulation of mortgage underwriting judgements. Nestor Inc., Providence, RI. Cox, D. R. 1970. The Analysis of Binary Data. Methuen, London. Crick, F. 1989. The recent excitement about neural networks. Nature (London) 337, 129-132. Cybenko, G. 1989. Approximations by superpositions of a sigmoidal function. Math. Control, Signals Syst. 2, 303-314. Dempster, A. I?, Laird, N. M., and Rubin, D. 8. 1976. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. SOC. B 39, 1-38. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Automatic learning, rule extraction and generalization. Complex Syst. 1, 877-922. Diaconis, P., and Freedman, D. 1986. On the consistency of Bayes estimates. Ann. Statist. 14, 1-26. Duda, I<. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York. Dudley, R. M. 1987. Universal Donsker classes and metric entropy. A n n . Prob. 15, 1306-1326. Faraway, J. J., and Jhun, M. 1990. Bootstrap choice o f bandwidth for density estimation. J. A m . Statist. Assoc. 85, 1119-1122. Fischler, M., and Elschlager, R. 1973. The representation and matching of pictorial structures. I € € € Transact. Comput. 22, 67-92. Fodor, J. A., and Pylyshyn, Z. 1988. Connectionism and cognitive architecture: A critical analysis. Cognition 28, 3-71. Friedman, J. H. 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-141. Friedman, J. H., and Stuetzle, W. 1981. Projection pursuit regression. J. A m . Statist. Assoc. 76, 817-823. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Gallinari, P., Thiria, S., and Fogelman, F. 1988. Multilayer perceptrons and data analysis. Proc. I E E E l C N N 88(1), 391-399. Geman, S., and Hwang, C. 1982. Nonparametric maximum likelihood estimation by the method of sieves. A n n . Statist. 10, 401414. Goldman, L., Weinberg, M., Weisberg, M., Olshen, R., Cook, E. F., Sargent, R. K., Lamas, G. A., Dennis, C., Wilson, C., Deckelbaum, L., Fineberg, H., and Stiratelli, R. 1982. A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain. N e w Engl. J. Med. 307, 588-596. Grenander, U. 1951. On empirical spectral analysis of stochastic processes. Ark. Matemat. 1,503-531. Grenander, U. 1981. Abstract Inference. Wiley, New York. Grenander, U., Chow, Y. S., and Keenan, D. 1990. H A N D S , A Pattern Theoretic

Neural Networks and the Bias/Variance Dilemma

55

Study of Biological Shapes. Springer-Verlag, New York. Guyon, I. 1988. Reseaux de neurones pour la reconnaissance des formes: architectures et apprentissage. Unpublished doctoral dissertation, University of Paris VI, December. Hansen, L. K., and Salamon, P. 1990. Neural network ensembles. IEEE Transact. Pattern Anal. Machine Intell. PAMI-12(10), 993-1001. Hardle, W. 1990. Smoothing Techniques with Implementation in S. Springer Series in Statistics. Springer-Verlag, New York. Hardle, W., Hall, P., and Marron, J. S., 1988. How far are automatically chosen regression smoothing parameters from their optimum? J. Am. Statist. Assoc. 83, 86-95. Hartman, E. J., Keeler, J. D., and Kowalski, J. M. 1990. Layered neural networks with gaussian hidden units as universal approximations. Neural Cornp. 2, 210-215. Haussler, D. 1989a. Generalizing the PAC model: Sample size bounds from metric dimension-based uniform convergence results. Proc. 30th Ann. Symp. Foundations Cornput. Sci., IEEE. Haussler, D. 1989b. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Preprint. Hinton, G. E., and Nowlan, S. J. 1990. The boostrap Widrow-Hoff rule as a cluster-formation algorithm. Neural Comp. 2, 355-362. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP group, eds., pp. 282-317. MIT Press, Cambridge, MA. Hornik, M., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Huber, P. J. 1985. Projection pursuit. Ann. Statist. 13, 435-475. Intrator, N. 1990. Feature extraction using an exploratory projection pursuit neural network. Ph.D. Thesis, Division of Applied Mathematics, Brown University, Providence, RI. Karnin, E. D., 1990. A simple procedure for pruning back-propagation trained neural networks. IEEE Transact. Neural Networks 1(2), 239-242. Knoerr, A. 1988. Global models of natural boundaries: Theory and applications. Ph.D. Thesis, Division of Applied Mathematics, Brown University, Providence, RI. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Cornp. 1,541-551. Lippmann, R. P. 1987. An introduction to computing with neural nets. IEEE ASSP Mag. April, 4-22. Lippmann, R. I? 1989. Review of neural networks for speech recognition. Neural Comp. 1, 1-39. Lorenzen, T. J. 1988. Setting realistic process specification limits - A case study. General Motors Research Labs. Publication GMR-6389, Warren, MI.

56

S. Geman, E. Bienenstock, and R. Doursat

Maechler, M., Martin, D., Schimert, J., Csoppenszky, M., and Hwang, J. N. 1990. Projection pursuit learning networks for regression. Preprint. Marron, J. S. 1988. Automatic smoothing parameter selection: A survey. Empirical Econ. 13, 187-208. Martin, G. L., and Pittman, J. A. 1991. Recognizing hand-printed letters and digits using backpropagation learning. Neural Comp. 3, 258-267. Morgan, N., and Bourlard, H. 1990. Generalization and parameter estimation in feedforward nets: Some experiments. In Neural Informafion Processing Systems II, D. S. Touretzky, ed., pp. 630-637. Morgan Kaufmann, San Mateo, CA . Mozer, M. C., and Smolensky, I? 1989. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed., pp. 107-115. Morgan Kaufmann, San Mateo, CA. OSullivan, F., and Wahba, G. 1985. A cross validated Bayesian retrieval algorithm for non-linear remote sensing experiments. J. Comput. Phys. 59, 441-455. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247,978-982. Pollard, D. 1984. Convergence of Stochastic Processes. Springer-Verlag, Berlin. Rissanen, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14(3), 108&1100. Rosenblatt, F. 1962. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986a. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 2: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP group, eds., pp. 318-362. MIT Press, Cambridge, MA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986b. Learning representations by backpropagating errors. Nature (London) 323, 533-536. Schuster, E. F., and Gregory, G. G. 1981. On the nonconsistency of maximum likelihood nonparametric density estimators. In Compiiter Science and Statistics: Proceedings of the 23th Symposium on the Interface, W. F. Eddy, ed., pp. 295-298. Springer-Verlag, New York. Schwartz, D. B., Samalam, V. K., Solla, S. A., and Denker, J. S. 1990. Exhaustive learning. Neural Comp. 2, 374-385. Scott, D., and Terrell, G . 1987. Biased and unbiased cross-validation in density estimation. J. Am. Statist. Assoc. 82, 1131-1146. Shepard, R. N. 1989. Internal representation of universal regularities: A challenge for connectionism. In Neural Connections, Mental Computation, L. Nadel, L. A. Cooper, P. Culicover, and R. M. Harnish, eds., pp. 104-134. Bradford /MIT Press, Cambridge, MA, London, England. Smolensky, P. 1988. On the proper treatment of connectionism. Behav. Brain Sci. 11, 1-74. Solla, S. A. 1989. Learning and generalization in layered neural networks: The

Neural Networks and the BiasNariance Dilemma

57

contiguity problem. In Neural Networks: From Models to Applications, L. Personnaz and G. Dreyfus, eds., pp. 168-177. IDSET, Paris. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictors (with discussion). J. R. Statist. SOC.B36, 111-147. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inferences of probabilities in layered networks: Predictions and generalization. In IJCNN International Joint Conference on Neural Networks, Vol. II, pp. 403409, IEEE, New York. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, New York. Vardi, Y., Shepp, L. A., and Kaufman, L. 1985. A statistical model for positron emission tomography (with comments). J. Am. Statist. Assoc. 80, 8-37. Veklerov, E., and Llacer, J. 1987. Stopping rule for the MLE algorithm based on statistical hypothesis testing. IEEE Transact. Med. Imaging 6, 313-319. von der Malsburg, Ch. 1981. The correlation theory of brain function. Internal Report 83-2, Max Planck Institute for Biophysical Chemistry, Dept. of Neurobiology, Gottingen, W.-Germany. von der Malsbhrg, Ch. 1986. Am I thinking assemblies? In Brain Theory, G. Palm and A. Aertsen, eds., pp. 161-176. Springer-Verlag, Heidelberg. von der Malsburg, Ch., and Bienenstock, E. 1986. Statistical coding and shortterm synaptic plasticity: A scheme for knowledge representation in the brain. In Disordered Systems and Biological Organization, E. Bienenstock, F. Fogelman, and G. Weisbuch eds., pp. 247-272. Springer-Verlag, Berlin. Wahba, G. 1979. Convergence rates of ”thin plate” smoothing splines when the data are noisy. In Smoothing Techniques for Curve Estimation, T. Gasser and M. Rosenblatt, eds., pp. 233-245. Springer-Verlag, Heidelberg. Wahba, G. 1982. Constrained regularization for ill posed linear operator equations, with applications in meteorology and medicine. In Statistical Decision Theory and Related Topics III, Vol.2, S. S. Gupta and J. 0.Berger, eds., pp. 383418. Academic Press, New York. Wahba, G. 1984. Cross validated spline methods for the estimation of multivariate functions from data on functionals. In Statistics: An Appraisal, Proceedings 50th Anniversary Conference Iowa State Statistical Laboratory, H. A. David and H. T. David, eds., pp. 205-235. Iowa State Univ. Press, Ames. Wahba, G. 1985. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Ann. Statist. 13, 1378-1 402. Wahba, G. and Wold, S. 1975. A completely automatic French curve: fitting spline functions by cross validation. Commun. Statist. 4, 1-17. Waibel, A,, Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1988. Phoneme recognition using time-delay networks. Proc. ICASSP-88, New York. White, H. 1988. Multilayer feedforward networks can learn arbitrary mappings: Connectionist nonparametric regression with automatic and semi-automatic determination of network complexity. UCSD Department of Economics Discussion Paper. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. 1,425464.

58

S. Geman, E. Bienenstock, and R. Doursat

White, H. 1990. Connectionists nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Networks 3, 535-549. Widrow, B. 1973. The rubber mask technique, Part I. Pattern Recognition 5, 175211. Yuille, A. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2, 1-24.

Received 4 May 1991; accepted 3 June 1991.

Article

Communicated by Christof Koch

A Model for the Action of NMDA Conductances in the Visual Cortex Kevin Fox Section of Neurobiology and Center for Neural Science, Brown University, Providence, NO2912 USA

Nigel Daw Department of Anatomy and Neurobiology, Washington University School of Medicine, Sf. Louis, M O 63112 U S A

A model is presented for the behavior of N-methyl-D-aspartate (NMDA) receptors during neuronal responses to visual stimuli in visual cortex based on classical steady-state equations for membrane conductance, and for binding of transmitter with NMDA and non-NMDA receptors. Constraints in the equations and the equation for voltage dependency of the NMDA receptors come from measurements in hippocampus, embryonic CNS neurons, and cortex. An excitatory amino acid transmitter released from terminals in the visual cortex is assumed to be related to the contrast of the stimulus by a hyperbolic tangent formula. The transmitter is assumed to be glutamate. The model is a two compartment representation of a single neuron at steady state. The model fits results obtained from contrast-response curves measured in the visual cortex in control conditions and with iontophoresis of NMDA and non-NMDA agonists or antagonists. The model shows how NMDA receptors contribute to the visual response in a graded multiplicative fashion at all levels of contrast. NMDA receptors do not show switch behavior, that is, they are not turned on at high levels of stimulation only. The difference in NMDA receptor current between low and high levels of stimulation cannot be accounted for solely by the voltage dependency of the NMDA receptor. One needs, in addition, another factor, such as a difference in the Hill coefficient for binding of glutamate at the NMDA receptor.

1 Introduction Sensory information in the cortex appears to be transmitted by excitatory amino acids (EAAs) (Tsumoto 19901, the receptors for which can be divided into two general classes, N-methyl-D-aspartate (NMDA) and non-NMDA. NMDA receptors have two unusual properties for ligand gated channels: they are voltage sensitive and they are permeable to Neural Computation 4, 59-83 (1992)

@ 1992 Massachusetts Institute of Technology

60

Kevin Fox and Nigel Daw

calcium (Mayer and Westbrook 1987; Stone and Burton 1988; McDonald and Johnston 1990). A number of papers have dealt with the role NMDA receptors may play in long-term potentiation in the hippocampus, concentrating on the calcium conductance of the NMDA channel and the effects of calcium on intracellular reactions (Zador et al. 1989; Holmes and Levy 1990). Of equal importance, however, is the voltage sensitivity of the channel and its postsynaptic influence on sensory integration within cortical cells: the latter is the subject of the present paper. Classic theories on synaptic mechanisms involve linear summation of EPSPs (Granit et al. 1966; Eccles 1964). However, a number of functions performed by the CNS are distinctly nonlinear in nature: in the visual cortex, for example, direction selectivity and disparity sensitivity. In the hippocampus the induction of long-term potentiation is a well-studied nonlinear system. Nonlinearities in cortical processing have tended to be explained either as arising from nonlinear inhibitory processes or the nonlinearities of a firing threshold. For example, shunting inhibition has been proposed as an explanation for direction selectivity (Torre and Poggio 1978) and firing threshold as an explanation for the nonlinearities of disparity sensitivity (Ohzawa and Freeman 1986). Recently, NMDA receptors have been found to behave in a nonlinear fashion in the cortex over the normal input range of the neuron (Fox et al. 1990) raising a third possibility, that nonlinear cortical processing could occur at excitatory synapses themselves. The nonlinearity involved could arise from the voltage sensitivity of the NMDA channel. Our evidence comes from iontophoretic experiments on contrast response curves in vivo. Non-NMDA agonists behaved in a conventional linear fashion and added a constant increase in firing rate to all visual responses independent of the response magnitude. NMDA agonists behaved in a nonlinear fashion, increasing the firing rate in proportion to the control response magnitude, that is, increasing the greater control responses more than the smaller. There are two general classes of model that might explain the behavior of the cortical cells; a neuronal circuit model that takes into account interactions between neurons, and a membrane model in which the membrane behavior of an individual cell provides the basis for the effect. We have investigated the membrane model as an explanation and present here a mathematical model, which describes the electrical behavior of the postsynaptic membrane under the circumstances of our experiment. We simulate the effect of visual stimuli of varying intensity on membrane voltage during administration of NMDA and non-NMDA agonists and antagonists. Given a number of reasonable assumptions described later, we find that the dendritic voltage in the model behaves similarly to the observed firing rate of cells recorded in layers I1 and 111 of the cortex, and may therefore underlie their behavior. Unexpectedly, the nonlinearity in the NMDA effect could not be attributed solely to the voltage dependence of the NMDA channel, but required additional factors such

NMDA Conductance in Visual Cortex

61

as a difference in the binding coefficient for glutamate at non-NMDA and NMDA receptors. An abstract of this work has appeared (Daw and Fox 1990). 2 Summary of Results in Visual Cortex

The data to be modeled were obtained from contrast-response curves measured from cells in the visual cortex of the cat (Fox et al. 1990). To obtain a contrast response curve we first determined the optimal stimulus for the cell, in terms of the best length, the best width, the best orientation, the best velocity, and the best direction of movement. Stimuli that are not optimal introduce inhibitory influences (Sillito 1987) that could vary with contrast and complicate the interpretation of the responses. The optimal stimulus was swept across the receptive field of the cell, and a poststimulus time histogram of the response was obtained. The contrast was varied between 0.1 and 10 times background, and the magnitude of the response was measured for six contrasts in this range. The contrast-response curve was then remeasured in the presence of NMDA or non-NMDA agonists or antagonists. In general, none of these drugs changed the level of contrast required for a threshold response, or the level of contrast required for saturation of the response. NMDA increased the slope of the curve while the NMDA antagonist APV reduced the slope of the curve (Fig. 1A). The non-NMDA agonist quisqualate increased spontaneous activity so that the response, after subtracting spontaneous activity, was unchanged (Fig. IB). The antagonist CNQX either reduced the slope of the curve or abolished the response altogether (Fig. 1C). These results were obtained in layers I1 and I11 of adult visual cortex but are also applicable to lower layers of kitten visual cortex, before the NMDA receptors in these layers lose their drive from visual synapses (Fox et al. 1989). In lower layers of adult cortex the visual synapses that are within reach of iontophoresis do not activate NMDA receptors directly, so that APV does not affect the visual response. However, exogenous NMDA increases the gain of the contrast response curve, implying an indirect modulation (Fox ef al. 1990). 3 Description of the Model

The model is devised for a stimulus sweeping slowly through the receptive field of a cell in area 17. In these circumstances the response is prolonged and a steady state of firing of action potentials is achieved, lasting 1-2 sec. Consequently we have ignored the activation kinetics of receptors and voltage-gated channels that have a time course of the order of milliseconds. The response seen in the soma consists of a relatively steady level of depolarization, with a constant level of firing superimposed on it (Fig. 2).

Kevin Fox and Nigel Daw

62

C

Log Contrast

control

B

'-1

.

-0-

om0

.

o m

Log Controst

Ik4

o

-am

om

o m

im

Log Contrast

Figure 1: (A) Effect of iontophoresis of NMDA and APV on the contrastresponse curve of a cell in layer 111. (B) Averaged effect of quisqualate and NMDA on 4 layer II/III cells. (C) Effect of APV and CNQX on a layer II/III cell (top)and a layer V cell (bottom). (Adapted from Fox et al. 1990.) 3.1 Membrane Properties. We presume that excitatory amino acid receptors are located on dendritie spines, but to simplify the model we locate non-NMDA and NMDA channels on the dendritic shaft. The equations for the membrane are the traditional voltage and conductance equations for membranes reduced to steady state:

+ iQ = 0

iM

+

iN

= gN(V-EN)

iN

(3.1) (3.2) (3.3) (3.4)

gQ(V- EQ) gMl(v- E M ) In these equations iN, gN, and EN are current, conductance, and reversal potential for the NMDA receptor; iQ, gQ, and EQ are current, conductance, and reversal potential for the non-NMDA receptor; EM is resting iQ

=

iM

=

NMDA Conductance in Visual Cortex

63

J

1

Figure 2: Intracellular recording from a layer II/III neuron recorded in area 17 during stimulation with a bar of light swept across the receptive field. Horizontal bar = 1 second. Vertical divisions = 20 mV. "Resting potential" N -50 mV. (Courtesy of Hiromichi Sato.)

potential; V is membrane voltage in the dendrite; and g M is the input conductance of the cell as seen from the synapse. Presumably many synapses are activated by an optimal stimulus and, at steady state, some of the current from each will go to depolarize the various spines and dendritic branches of the neuron and some current will converge to depolarize the soma. In this model, we simplify the morphology of the cell such that the total synaptic current, iM, is represented by a local dendritic current i~ and a remote somatic current is. iM = iD

+ is

(3.5)

We assume, as a first approximation, that the membrane behaves linearly apart from the synaptic conductances. The resting input conductance of

Kevin Fox and Nigel Daw

64

the membrane seen from the synapse is separated into a local dendritic (gD),an axial dendritic (ga),and a somatic (gs) component (see Fig. 3). (3.6) The rate at which the cell fires action potentials is related to somatic current (Connors et al. 1988) and therefore to is in our model (see Fig. 3).

3.2 Receptor Characteristics. The equation for the conductance through the non-NMDA channel involves a Hill coefficient of 1 for the binding of ligands to the non-NMDA receptor. A Hill coefficient of 1 has been measured in hippocampal slices (Yamada et al. 1989) and applies to the peak current measured before desensitization (Patneau and Mayer 1990). Expressing the concentrations of the ligands relative to their KdS (Colquhoun 1973) the conductance of the non-NMDA channel is

G, + c,

gQ= kQ1+ G,

+ C,

(3.7)

where G, =

glutamate concentration at quisqualate receptor Kd for glutamate at the quisqualate receptor

(3.8)

and similarly C, is the ratio of the quisqualate concentration at the quisqualate receptor to the Kd for quisqualate at the quisqualate receptor. To allow for the effect of the antagonist CNQX, the equation becomes

G, + c,

+ + C,

= kQAq G,

(3.9)

where kQ is the maximum conductance and A,=1+

conc. of CNQX Ki for CNQX

(3.10)

The equation for the conductance through the NMDA channel involves a Hill coefficient of 2 for the binding of ligands to the NMDA receptor. A slope of 2 has been measured for NMDA receptor activation at low concentrations of agonists in hippocampal cell cultures (Patneau and Mayer 1990) and from cortex plus hippocampus synaptic membranes (Javitt et al. 1990). A voltage-dependent term is also required that can be described by a Boltzmann equation. The exponent in the voltage-dependent term (a) and its relationship to magnesium concentration ([Mgl/k) have been measured in cultured mouse neurons and dissociated hippocampal neurons (Ascher et al. 1988; Jahr and Stevens

65

NMDA Conductance in Visual Cortex

non-NMDA channel I I

I I

non-visual synaptic receptors

I I I I

I

I

- - - r - - I- - - - - - - -

I

/ /

Vmd

Vms is = k/t

I is

ct>

Figure 3: Schematic representation of the electrical model of the neuron. The model is composed of two compartments. The dendritic compartment consists of three conductances: NMDA @N), non-NMDA @a),and resting dendritic conductance QD). The soma conductance (8s) is connected to the dendrite via an axial conductance @a). Somatic current (is) is proportional to firing rate (f>. Vmd represents the dendritic voltage (v in equations 3.1 to 3.4) and V,, the somatic voltage.

Kevin Fox and Nigel Daw

66

1990). The conductance through the NMDA channel can therefore be written as (3.11)

where G,, C,, A,, and kN are the NMDA receptor equivalents of G,, C,, A,, and k~ for the non-NMDA receptor (equations 3.9 and 3.10). The term [Mgl represents extracellular magnesium concentration. 3.3 Presynaptic Relationship. The concentration of transmitter released from the afferent terminals is assumed to be related to the contrast (C) of the visual stimulus by a hyperbolic tangent formula. n

G=G -,,

L

c + c50 + Gmin

(3.12)

The constant G,, represents the change in transmitter released between saturation and threshold levels of contrast. The constant Gmin represents the glutamate spontaneously released at threshold levels of contrast, and the constant CSOrepresents the contrast that gives a half maximal change in response. The hyperbolic tangent formula has been found to represent the response of photoreceptors (Baylor and Fuortes 19701, the response of horizontal cells (Naka and Rushton 1966),and of cells in the visual cortex (Galetti et al. 1979). It has the property of commutativity: that is, if the response of photoreceptors obeys a hyperbolic tangent formula, and the synapse between photoreceptors and horizontal cells represents a hyperbolic tangent formula, then the response of horizontal cells will also obey a hyperbolic tangent formula. Consequently hyperbolic tangent relationships for each cell along the pathway will yield a hyperbolic tangent relationship for the release of transmitter from each terminal along the pathway. 4 Parameters of the Model

The common parameters of the model are listed in Table 1. Specific Values for different examples are given in the legends of the figures to which they apply. Note that values for conductance and concentration are dimensionless throughout most of the text as they are expressed relative to resting conductance ( g ~and ) to the &s for the receptors. Conductances are relative to gM, which is taken as unity for convenience with no loss of generality. Constraints on g N and gQare described below (Section 4.3).

4.1 Values from the Literature. Many parameters in the model were taken from measurements made in other preparations, particularly cultured hippocampal and mouse CNS cells, where the properties of NMDA

NMDA Conductance in Visual Cortex

67

Table 1: Parameters of the Model Parameters from the literature For glutamate at the non-NMDA receptor For glutamate at the NMDA receptor EN Reversal potential at NMDA receptor EQ Reversal potential at non-NMDA receptor EM Resting potential a Voltage dependency of NMDA receptor [Mgl/k Magnesium conc. dependency for NMDA Ki For CNQX at non-NMDA receptor Ki For APV at NMDA receptor Kd Kd

Parameters constrained by contrast-response curve data

kQ kN G,

Gmin

Maximal non-NMDA conductance Maximal NMDA conductance Change in glutamate concentration Minimum glutamate concentration

Value 360 1.1 0 0 -70 -0.08 0.2 3 5

pM pM mV mV mV V-' PM PM

Value 58 11.3 1 PM 0.1 pM

receptors have been studied most extensively. Our assumption has been that the characteristics of EAA receptors are the same in all preparations. We have taken the transmitter at the receptor to be glutamate. Thus the Kd for the physiological effect of glutamate at the non-NMDA receptor was taken to be 360 pM and for the NMDA receptor 1.1 pM. These specific values came from Patneau and Mayer (1990) and are in the general range measured by Yamada et al. (1989) and Trussell et al. (1988). The reversal potential for both the NMDA and non-NMDA receptors is 0 mV (Ascher et al. 1988; Ascher and Nowak 1988a). Resting potential is approximately -70 mV (Connors et al. 1988; Flatman et al. 1986). The voltage dependency of the NMDA receptor has been shown to obey a Boltzmann equation. We took the exponent to be a = -0.08 V-', and the factor IMgl/k to be 0.2 (equation 3.11); the specific values are from Ascher and Nowak (1988b) and are close to those measured by Jahr and Stevens (1990). 4.2 Values Unknown, but Relative. Some parameters are unknown, for example, the concentrations of iontophoresed substances. However, they only come into the formulas as values relative to the &s for the agonists (16.4 pM for NMDA at the NMDA receptor (Patneau and Mayer 1990) and 67 pM for quisqualate at the non-NMDA receptor (Yamada et al. 1989) and relative to the Kis for the antagonists (3.5 pM for CNQX at the non-NMDA receptor (Yamada et al. 1989) and 5 pM for APV at the

Kevin Fox and Nigel Daw

68

NMDA receptor (Harrison and Simmonds 1985; Greenmayre et al. 1985). Consequently we worked with relative and dimensionless numbers. The absolute concentrations are realistic in so far as micromolar concentrations are typically achieved with iontophoresis using carbon fiber electrodes (Armstrong-Jamesand Fox 19831, but effects of concentration gradients at distances away from the iontophoretic source are not taken into account in this model. 4.3 Values Constrained by the Analysis. Some parameters have not been measured directly, but are constrained by the results of our contrastresponse curve analysis. For example, kN and kQ are the maximal conductances that can be obtained in the NMDA and non-NMDA channels, relative to the resting conductance if all channels were open and fully activated. Two observations have guided our choice of values. First, APV decreases visual responses of layer II/III cells by about half. We therefore chose kN and k~ to make g N and gQ of the same order of magnitude at the highest levels of contrast. Second, if the dendrite is depolarized by more than about 35 mV during agonist application, iontophoresis of additional NMDA leads to a shift of the curve to the left as well as an increase in the slope, and this is not what is observed (see below). Moreover if the dendrite is depolarized by less than 20 mV during agonist application this is not enough to activate the NMDA conductances substantially. We therefore chose kN and k~ so that, for maximum, the sum of g N and gQ depolarized the dendrite by 25-35 mV during agonist application (20 mV under control conditions).

5 Performance of the Model Given all these constraints on the parameters this leaves only two parameters to be varied in simulations of the data. These are release of transmitter at spontaneous levels of activity (G,in) and the maximum range of release produced by saturation levels of contrast (Gmax).Even in the case of these parameters, it turns out that there are constraints. We used our modeling program (Tutsim, Palo Alto, CA) to test a variety of combinations of these two parameters, and discovered the following points:

1. G, should be at least one order of magnitude greater than Gmin in order to get a substantial difference in the voltage-dependent NMDA conductance at high levels of contrast compared to low levels of contrast. 2. G ,,, cannot be much greater than the & for binding of the transmitter at the NMDA receptor, otherwise iontophoresis of NMDA increases spontaneous activity rather than gain.

NMDA Conductance in Visual Cortex

-20

69

1

0

-40

-70 0.05

0.10 0.20 0.50 1.00

2.00 5.00 10.00

Contrast of stimulus

Figure 4: Simulation of the effects of iontophoresis of quisqualate, NMDA, APV, and CNQX on the contrast-response curve. Parameters G, = 1 pM, Gmin = 0.1 pbf, k~ = 58, kN = 11.3, iontophoresis quisqualate 0.0052 x K d , NMDA 0.38 x K d , AVP 4 x Ki, CNQX 4 x Ki. 3. If Gmin is too low, then APV has no effect on the visual response near threshold. The results of the modeI, using the parameters described in the preceding section, are shown in Figure 4. For the iontophoresis of agonists and antagonists we used 3.5 pLM quisqualate (0.0052 x &), 4.2 pM NMDA (0.38 x K d ) , 20 pM APV (4 x Ki), and 12 pM CNQX (4 x Ki). One can see that the model gives the basic form of the results obtained in the visual cortex, namely that iontophoresis of NMDA increases the slope of the curve, APV reduces the slope of the curve, quisqualate moves the whole curve upward, and CNQX reduces the slope substantially. Levels of contrast that give threshold and saturation are not changed very much by the application of any of these drugs, in agreement with the observations. The parameters in the model can be changed in various ways resulting in curves that do not fit the data as well. These can be detailed as follows.

Kevin Fox and Nigel Daw

70

-20

A

> E v

-30

a,

:

-40

4

E a,

5

-60

,-70 - - - - -~ 005

0 10

020

050

100

2 00 5 0 0 10.00

Contrast of s t i m u l u s

Figure 5: Effect of iontophoresis of NMDA when there is substantial depolarization of the dendrite in the control condition. Parameters G, = 1 pM, Gmin = 0.1 pM, kq = 75,kN = 15, iontophoresis of NMDA 0.2 x K d and 0.4 x K d .

5.1 Increasing the Density of Postsynaptic Receptors. If the density of postsynaptic receptors is increased so that the level of depolarization in the dendrite during agonist application is more than 35 mV in response to a high contrast visual stimulus, then iontophoresis of additional NMDA leads to a shift of the saturation level to lower contrasts, as well as a change in the slope of the contrast-response curve (Fig. 5). This occurs because the change in membrane voltage is limited by the reversal potential of the EAA receptors, which is close to 0 mV. Consequently, in the variations listed below, kN and ko were adjusted to give the same level of depolarization in the dendrite for the control condition, and approximately equal contributions of iN and iQ to the current at high contrasts of the stimulus. 5.2 Increasing Glutamate Release from the Presynaptic Terminal. If the glutamate released from the presynaptic terminal is greater than the

NMDA Conductance in Visual Cortex

71

for glutamate at the NMDA receptor, then iontophoresis of NMDA and APV leads to a shift of the entire contrast-response curve upward and downward, respectively, rather than a change in the slope of the curve (Fig. 6). If Gmin is of the same order of magnitude as the Kd for glutamate at the NMDA receptor, and G,, is an order of magnitude greater, then the amount of NMDA that has to be iontophoresed to have an effect becomes very large. Kd

5.3 Changing the Exponent in the NMDA Voltage-Sensitive Term. The best fit of the model to the data was obtained with an exponent of a = -0.08 and a constant of [Mg]/k = 0.2 (Ascher and Nowak 1988b). An exponent of u = -0.065 and a constant of [Mg]/k = 0.5 (Jahr and Stevens 1990) did not change the effect of quisqualate, APV, or CNQX very much, but meant that NMDA had less effect on the slope of the contrast-response curve (Fig. 7).

5.4 Changing the Coefficient in the Binding Curves. Our model has a coefficient of 2 for the binding curve for glutamate at the NMDA receptor and a coefficient of 1 for the binding curve for glutamate at the non-NMDA receptor. There is good evidence for a coefficient of 2 for the NMDA receptor (Patneau and Mayer 1990; Javitt et al. 1990) and a coefficient of 1 applies to the response of glutamate at the non-NMDA receptor in the nondesensitized state (Patneau and Mayer 1990). If the coefficientsare made the same, either 1 for both NMDA and non-NMDA receptors, or 2 for both NMDA and non-NMDA receptors, then the fit of the model to the data is substantially worse (Fig. 8). Essentially the curves for iontophoresis of NMDA and quisqualate become parallel to each other in the lower portion of the contrast-response curve. For a coefficient of 2 for both receptors, the value of kQ is also substantially increased (with G,, and Gminkept the same). This is because two molecules of glutamate are now required to activate the non-NMDA channel, so that for relatively low levels of glutamate (G,,, is two orders of magnitude below the K d for glutamate at the non-NMDA receptor), k~ must be made very large to compensate. 6 Nonvisual Glutamate Receptors

In practice, there will be NMDA and non-NMDA receptors within the reach of the iontophoresed drugs that are not activated by the visual stimulus. To take account of these, the equations can be modified as follows: Gq + c, gQ = kQAq+ G, + C,

Gqe +

+ Cq

kQEA,+ G,,

+ C,

(6.1)

Kevin Fox and Nigel Daw

72

0 05 0 02

0 I0

0 20

0 50

1.00

2.00

5.00

Cuntrcist of stimulus

-> E

v

-20

-30

e,

m 0 Y -

-40

0

> Q,

c

-50

0 L\

E

-

63

- 70

Figure 6: Increase in release of glutamate from the presynaptic terminal. Top: Parameters G , = 3 pM, G d n = 0.3 pM, kQ = 19.6, kN = 4.9, iontophoresis quisquaIate 0.017 x NMDA 3.8 x K d , AVP 4 x Ki, CNQX 4 x Ki. Bottom: Parameters G,, = 10 pM, Gmin = 1 pM, kQ = 6.8,k~ = 3, iontophoresis quisqualate 0.06 x Kd, NMDA 500 x Kd, AVP 4 x Ki, CNQX 4 x Ki.

where G,, is the extrasynaptic concentration of glutamate relative to the Kd for glutamate at the non-NMDA receptor, and kQE is the maximal conductance of non-NMDA receptors at sites not activated by the visual

NMDA Conductance in Visual Cortex

-20

73

1

A

>

t

-30

v

-70

0.05 0.10 0.20 0.50 1.00 2 00 5.00 10.00

Contrast of st irnul u s

Figure 7 Change in exponent of NMDA receptor voltage-sensitive term. Exponent, a = -0.065 V-*, constant [Mg]/k = 0.5. Parameters , ,G = 1 PM, Gmin = 0.1 pM, k~ = 60, k~ = 13, iontophoresis of quisqualate 0.008 x &, NMDA 0.75 x K d , AVP 4 x Ki, CNQX 4 x Ki. stimulus, and gN=

{ [ kN

An

Gn

+ Cn

+ Gn +Cn

]'+k~[

Gne+Cn

An + G n e + Cn

]'}

Vtem

(6.2)

where G,, is the extrasynaptic concentration of glutamate relative to the & for glutamate at the NMDA receptor, kNE is the maximal conductance of NMDA receptors at sites not activated by the visual stimulus, and Vtem is the Boltzmann voltage dependent term described in equation 3.11. Addition of these terms in fact leads to some improvement in the fit of the model to the data, in that APV has a greater effect close to threshold (Fig. 9). The parameters chosen are identical to those for the simulation shown in Figure 4 except that extrasynaptic NMDA and nonNMDA receptors have been added in equal numbers to those under the visual synapses.

Kevin Fox and Nigel Daw

74

-20 ,--,

1

>

.E

-30

Quis

Q,

c7 4

-

-40

0

> Q,

C

Control -50

0 L

13

E

-60

W 2 70 005

010

020

050

100

2C0

500

1000

500

10.00

Contrast of S t i m u l u s

-

A

> E

1

-20 -30

0

> C

-50

0 L

u

E

-60

a,

2 70

005

0 10

020

050

100

200

C o n t r a s t of s t i m u l u s

Figure 8: Change in coefficient of binding of glutamate to NMDA and nonNMDA receptors. Top: Coefficient of binding for both receptors 1. Parameters , ,G , = 1 pM, G- = 0.1 pM, k~ = 60, kN = 5.4, iontophoresis of quisqualate 0.0055 x Kd, NMDA 1.2 x Kd, AVP 4 x Ki, CNQX 4 x Ki. Bottom: Coefficient of binding for both receptors 2. Parameters G, = 1 pM, Gfin = 0.1 pM, 4 = 20, OOO, kN = 11, iontophoresis of quisqualate 0.002 x Kd, NMDA 0.4 x Kd, AVP 4 x Ki, CNQX 4 x Ki.

NMDA Conductance in Visual Cortex

75

-20

a, -40

I

-60

- 70

0.05 0.10

0.20

0.50 1.00 2.00

5.00 10.00

Contrast of stimulus

Figure 9: Effect of glutamate receptors not driven by visual stimuli. Parameters ,,G = 1 pM, G- = 0.1 PM, 4 = 58, k~ = 11.3, iontophoresis of quisqualate 0.0023 x &, NMDA 0.17 x &, AVP 4 x Ki, CNQX 4 x Ki. 7 Lower Layers of Adult Visual Cortex

The parameters that we have presented so far produce results applicable to all layers of kitten visual cortex up to the third week of age, and layers I1 and I11 of adult visual cortex. In these layers NMDA receptors made a substantial contribution to the visual response, demonstrated by the finding that the visual response is reduced by iontophoresis of APV. The situation is different in lower layers of older cats (Fox et al. 1989, 1990). In layers V and VI, iontophoresis of APV does not affect the visual response, although it may affect the spontaneousactivity of the cell, while iontophoresis of NMDA increases the visual response. This situation can be modeled by decreasing kN and increasing km (Fig. 10). In this example the extreme case is modeled where there are no NMDA receptors under visual synapses (kN = 0). The quisqualate conductance is adjusted to produce approximately 20 mV depolarization at saturation contrasts (kq = 1001, in keeping with all the simulations. The remaining parameter,

Kevin Fox and Nigel Daw

76

-20

A

> E v

-30

I

Quis

0

> a,

c

-50

1

I

I

I

I

1

1

0.05 0.10 0.20 0.50 1.00 2.00 5.00 10.00

Contrast of stim u l u s

Figure 10: Simulation of results in layers V and VI. Parameters G ,, = 1 ,uM, G,i, = 0.1 ,uM, k~ = 100, kN = 0, k,E = 0, ~ N = E 25, iontophoresis of quisqualate 0.0046 x Kd, NMDA 0.26 x Kd, AVP 4 x Ki, CNQX 4 x Ki.

NMDA receptors not under visual synapses, is then adjusted until APV iontophoresis results in a decrease in spontaneous activity, as observed experimentally. All other parameters are the same as for the other simulations. The gain increase, resulting from NMDA iontophoresis onto the layer V/VI cells, is accompanied by a larger increase in spontaneous activity than occurs with simulations for layer II/III cells (compare NMDA traces in Figs. 9 and 10) and is therefore in keeping with experimental results (Fox et al. 1990). In layer IV, neither iontophoresis of APV nor iontophoresis of NMDA has much effect on visual responses, which can be modeled by decreasing both kN and k p . ~(not ~ shown). ~ N E ,representing

NMDA Conductance in Visual Cortex

77

8 Discussion

One of the main results from our work on contrast-response curves in the visual cortex was that NMDA receptors contribute to the visual response in a graded fashion, all the way from near threshold to saturation. If APV reduces the response by 50% at high levels of contrast, then it will also reduce the response by about 50% at low levels of contrast near threshold. The model that we have presented here shows how this can occur, using the voltage-dependent properties of the NMDA receptor. In constructing this model we were interested to find out whether it was possible to account for our results in visual cortex based purely on the membrane properties of the cell. We did not, therefore, explore models involving networks of neurons with feedback connections. We found that in order to make our model mimic the experimental results the non-NMDA receptor needed to be unsaturated even at maximum contrast of stimulation. With this result included in the model, the additive effect of quisqualate was very robust to changes in other parameters. If the non-NMDA receptor was saturated at high contrast inputs, however, the NMDA receptor became saturated at low contrasts and NMDA had no additional effect on the gain, which was not the experimentally observed effect. We have considered only the steady state in our model, arguing that this is reasonable given that intracellular recordings from cells during long duration response to stimulation with moving bars of light show prolonged steady depolarizations. However, in doing so we have not taken into account the temporal properties of the NMDA and non-NMDA receptors that differ, and may therefore contribute to the functional difference in behavior of the two receptor types in the visual cortex under conditions of natural stimulation (Fox et al. 1990). The effect on the cells behavior of the differing temporal properties of the channels associated with the NMDA and non-NMDA receptors probably warrants further exploration. There are other assumptions that have been made in our formulation. One is that there is a linear relationship between depolarizing current and firing rate. This is a good assumption for current injected into the cell body (Connors et al. 1988). However, there is some loss of current between the dendritic spine and the axon hillock, which leads to nonlinearities in the equations (Brown et al. 1988): though for relatively small synaptic conductance changes, most of the synaptic current is transferred unattenuated from the spine head to the dendrite (Wilson 1984). In favor of the latter argument, small synaptic conductances have been measured in cultured hippocampal neurons (Beckers and Stevens 1989). A second assumption is that there are no other voltage-sensitive conductances, such as Ca2+conductances in the spine head or between the spine and the soma. If one does take these into account, the equations become sufficiently complicated that simple predictions can no longer be made.

78

Kevin Fox and Nigel Daw

Excitatory amino acid receptors probably reside on spine heads. It has been suggested that voltages in the spine could approach 0 mV, even for unitary synaptic events (Koch and Poggio 1983). Our model performs best if the voltage affecting the NMDA receptors does not exceed approximately -35 mV at saturation contrasts. These two predictions are not contradictory if the main voltage affecting the NMDA receptors (Vmd) is the steady state voltage of the dendritic shaft, which would not follow the spine head voltage transients due to the attenuating effect of the spine neck resistance and the dendritic capacitance. There are two reasons why the dendritic voltage is likely to be dominant; first, transient voltage swings towards 0 mV at the spine head would have relatively little effect on the NMDA-channels because of their slow activation kinetics (Lester et al. 1990) compared to the effect of a slow depolarization (a similar argument has been made by Gustafsson and Wigstrom (1988)). Second, a slow depolarization could arise in the dendritic shaft from the time integration of many spinous currents, and the resultant voltage would appear unattenuated at the spine head (Shepherd and Brayton 1979) where it could modulate the NMDA channels. In this model the release of transmitter at the synapse varies in concentration with contrast. This is a reasonable assumption for transmitter release in the retina where release is controlled by membrane voltage, but is not definitely known to occur in the cortex where release is controlled by action potentials. There is some evidence that receptors are saturated at individual synapses in hippocampus with the least quanta1 release of transmitter (Larkman et al. 1991). However, this is not known to be the case in visual cortex. Differences may exist between release caused by brief intermittent electrical stimuli and maintained presynaptic activity. For example, maintained discharge may exhaust transiently high levels of transmitter release. Differences may also exist between in vivo and in vitro conditions, for example, glutamate uptake mechanisms probably work at greater rates in vivo. However, if cortical synaptic receptors are always saturated, then the extra nonlinearity required to explain our results must derive from a source other than that provided by the Hill coefficient of 2 for the NMDA receptor. The model is based on parameters taken from a number of studies, conducted on hippocampal cells and cultured mouse embryonic neurons as well as cerebral cortex. It therefore seems likely that these findings will be generally applicable to various regions of the central nervous system where NMDA receptors are found. However, to really prove the general applicability of the model, the various parameters will have to be measured ir, visual cortex to show that they are the same as in other areas, and response-intensity curves will have to be measured in the hippocampus, to show that they are affected by drugs in the same smooth graded fashion as in the visual cortex. Recent evidence does suggest that APV affects responses at spontaneous levels of activity in the hippocampus (Sah et a!. 1989), developing turtle cortex (Blanton et al. 1990) and cat

NMDA Conductance in Visual Cortex

79

visual cortex (Fox et al. 1989). This point is certainly consistent with the model. LTP in the hippocampus is turned on by high levels of stimulation, but not by low levels of stimulation (Bliss and Lomo 1973), and it is known that NMDA receptors (Collingridge and Bliss 1987) and calcium entry (Malenka ef al. 1988) are involved in this process. At some level in the series of reactions between stimulation of the afferents and establishment of LTP, there has to be a "switch that turns LTP on. Our results suggest that the voltage dependency of the NMDA receptor may not be sufficient to yield a switch-like behavior, since it contributes to the response in a graded fashion at all levels of stimulation. The switch could be a nonlinear detector further down the chain of intracellular reactions as proposed by Lisman (1985): the nonlinearity in calcium-calmodulin activation is one such candidate (Gamble and Koch 1987). In the process of changing the various parameters in our model, we discovered that one can produce a sudden change in the response at a particular level of stimulation, with a switch-like behavior. The parameters that are necessary to give this behavior are any combination of kN, k ~ Gmin, , and,,G that leads to depolarization in the dendritic spine of more than 35 mV at medium levels of contrast. An example is shown in Figure 11. This is not the way that the system behaves in the visual cortex as shown in Figure 4. Whether this is the way that it behaves in the hippocampus remains to be seen. For it to happen there in a fashion that is responsible for LTP, the conductance would have to change to more than double resting conductance on activation at frequencies of stimulation that produce LTP. Experiments similar to those of Martin (1989) in the visual cortex have not yet been done in the hippocampus to test this point.

9 Conclusions

The essential point in our results is that NMDA receptors contribute a substantial component of the response (at least half) at high levels of contrast, and a small but measurable component at low levels of contrast. In the past, the aspect of NMDA receptors that has received the most attention in accounting for this is the voltage dependency. In our model there is a second component that makes a contribution - the difference in the Hill coefficient for the physiological effect of glutamate at the NMDA receptor compared to the non-NMDA receptor. In fact, our calculations show that the voltage dependency of the NMDA receptor increases the NMDA component of the conductance by a factor of 3.65 between threshold and saturation, while the difference in Hill coefficient increases it by a factor of 36. We discovered that, without a difference in Hill coefficient, we could not reproduce the results from the visual cortex

Kevin Fox and Nigel Daw

80

-20

t a) -40

I

005

0 10 020 050

I

I

1.00 200

I

1

500 1000

Contrast of s t i m L l u s

Figure 11: Increase in density of glutamate receptors leads to switch behavior. Parameters G ,, = 1 pM,Gmin = 0.1 pM, k Q / k N = 5.13, k~ varied to 12, 15, 18, and 21.

for layer II/III cells unless the exponent in the Boltzmann equation was increased substantially from -0.08. We find it intriguing that our simple model does fit the results from visual cortex, and makes four predictions that can be tested by further experiment: (1) that the depolarization in the dendrite is not more positive than -35 mV at saturation contrasts during sensory stimulation, (2) that the voltage sensitivity of the NMDA receptor is necessary but not sufficient to explain how NMDA multiplies the response while quisqualate adds to it, (3) that the voltage sensitivity of the NMDA receptor does not lead it to act like a switch in the visual cortex, and (4) the membrane of individual cortical cells should behave similarly to our model in the presence of EAA agonists and antagonists.

NMDA Conductance in Visual Cortex

81

Acknowledgments We express thanks to Barry Connors, Larry Cauler, and John Lisman for their helpful criticisms and suggestions. This work was supported by NIH Grants NS 27759 to K. F. and EY 00053 to N. D.

References Armstrong-James,M., and Fox, K. 1983. Effects of ionophoresed noradrenaline on the spontaneous activity of neurones in rat primary somatosensory cortex. J. Physiol. 335, 427-447. Ascher, P., and Nowak, L. 1988a. Quisqualate and kainate-activated channels in mouse central neurones in culture. J. Physiol. 399,227-246. Ascher, P., and Nowak, L. 1988b. The role of divalent cations in the N-methylD-aspartate responses of mouse central neurones in culture. 1.Physiol. 399, 247-266. Ascher, P., Bregestrovski, P., and Nowak, L. 1988. N-Methyl-D-aspartate activated channels of mouse central neurones in magnesium-free solutions. J. Physiol. 339, 207-226. Baylor, D. A,, and Fuortes, M. G. F. 1970. Electrical responses of single cones in the retina of the turtle. J. Physiol. 207, 77-92. Bekkers, J. M., and Stevens, C. F. 1989. NMDA and nonNMDA receptors are colocalized at individual excitatory synapses in cultured rat hippocampus. Nature (London) 341, 230-233. Blanton, M. G., LoTurco, J. J., and Kriegstein, A. R. 1990. Endogenous neurotransmitter activates N-methyl-D-aspartate receptors on differentiating neurons in embryonic cortex. Proc. Natl. Acad. Sci. U S A . 87, 8027-8030. Bliss, T. V. P., and Lomo, T. 1973. Long-lasting potentiation of synaptic transmission in the dentate area of the anesthetized rabbit following stimulation of the perforant path. 1. Physiol. 232, 331-356. Brown, T. H., Change, V. C., Ganong, A. H., Keenan, C. L., and Kelso, S. R. 1988. Biophysical properties of dendrites and spines that may control the induction and expression of long-term synaptic potentiation. In Long-Term Potentiation: From Biophysics to Behavior, P. W. Landfield and S. A. Deadwyler, eds., pp. 201-264. Alan R. Liss, New York. Collingridge, G. L., and Bliss, T. V. P. 1987. NMDA receptors: Their role in long-term potentiation. Trends Neurosci. 10,288-293. Colquhuon, D. 1973. The relation between classical and cooperative models for drug action. In Drug Receptors, H. P. Rang, ed., pp. 147-182. Macmillan, New York. Connors, 8. W., Malenka, R. C., and Silva, L. R. 1988. Two inhibitory postsynaptic potentials, and GABA-A and GABA-B receptor-mediated responses in neocortex of rat and cat. J. Physiol. 406,443468. Daw, N. W., and Fox, K. 1990. How NMDA and non-NMDA receptors contribute to responses in visual cortex. SOC.Neurosci. Abstr. 16, 1185. Eccles, J. C. 1964. The PhysioZogy of Synapses. Springer-Verlag, Berlin.

82

Kevin Fox and Nigel Daw

Flatman, J.A., Schwindt, P. C., and Crill, W. E. 1986. The induction and modification of voltage-sensitive responses in cat neocortical neurons by N-methylD-aspartate. Brain Res. 363,62-77. Fox, K.,Sato, H., and Daw, N. W. 1989. The location and function of NMDA receptors in cat and kitten visual cortex. J. Neurosci. 9, 2443-2454. Fox, K., Sato, H., and Daw, N. W. 1990. The effect of varying stimulus intensity on NMDA-receptor activity in cat visual cortex. J. Neurophysiol. 64, 14131428. Galetti, C., Maioli, M. G., Squatrito, S., and Sanseverino, E. R. 1979. Single unit responses to visual stimuli in cat cortical areas 17 and 18. Arch. Ital. Biol. 117, 208-230. Gamble, E.,and Koch, C. 1987. The dynamics of free calcium in dendritic spines in response to repetitive synaptic input. Science 236, 1311-1315. Granit, R., Kernell, D., and Lammarre, Y. 1966. Algebraic summation in the synaptic activation of motoneurones firing within the primary range to injected currents. ]. Physiol. 187,379-399. Greenmayre, J.T., Olson, J. M. M., Penney, J. B., and Young, A. B. 1985. Autoradiographic characterization of N-methyl-D-aspartate, quisqualate and kainate-sensitive glutamate binding sites. J. Pharmacol. Exp. Ther. 233, 254263. Gustafsson, B., and Wigstrom, H. 1988. Physiological mechanisms underlying long-term potentiation. Trends Neurosci. 11, 156162. Harrison, N. L., and Simmonds, M. A. 1985. Quantitative studies on some antagonists of N-methyl-D-aspartate in slices of rat cerebral cortex. Br. J. Pharmacol. 84,381-391. Holmes, W. R., and Levy, W. B. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor mediated calcium influx and intracellular calcium concentration changes. J. Neurophysiol. 63, 1148-1168. Jahr, C.E., and Stevens, C. F. 1990. A quantitative description of NMDA receptor channel kinetic behavior. J. Neurosci. 10, 1830-1837. Javitt, D. C., Frusciante, J. M., and Zukin, S. R. 1990. Rat brain N-methyl-Daspartate receptors require multiple molecules of agonist for activation. Mol. Pharmacol. 37,603-607. Koch, C., and Poggio, T. 1983. A theoretical analysis o f the electrical properties of spines. Proc. R. SOC.London Ser. B 210, 455-477. Larkman, A., Stratford, K., and Jack, J. 1991. Quanta1 analysis of excitatory synaptic action and depression in hippocampal slices. Nature (London) 350, 344-347. Lester, R. A. J., Clements, J. D., Westbrook, G. L., and Jahr, C. E. 1990. Channel kinetics determine the time course of NMDA receptor mediated synaptic currents. Nature (London) 346,565-567. Lisman, J. E. 1985. A mechanism for memory storage insensitive to molecuIar turnover: A bistable autophosphorylating kinase. Proc. Natl. A d . Sci. U.S.A.82,3055-3057. Malenka, R. C., Kauer, J. A., Zucker, R. S., and Nicoll, R. A. 1988. Postsynaptic

NMDA Conductance in Visual Cortex

83

calcium is sufficient for potentiation of hippocampal synaptic transmission. Science 242, 81-84. Martin, K. A. C. 1989. From single cells to simple circuits in the cerebral cortex. QJEP 73,637-702. Mayer, M. L., and Westbrook, G. L. 1987. The physiology of excitatory amino acids in the vertebrate central nervous system. Prog. Neurobiol. 28, 197-276. McDonald, J. W., and Johnston, M. V. 1990. Physiological and pathophysiological role of excitatory amino acids during central nervous system development. Brain Res. Rev. 15, 41-70. Naka, K. I., and Rushton, W. A. 1966. Spotentials from colour units in the retina of fish (Cyprinidae). J. Physiol. 185, 536-555. Ohzawa, I., and Freeman, R. D. 1986. The binocular organisation of simple cells in the cat's visual cortex. J. Neurophysiol. 56, 221-242. Patneau, D. K., and Mayer, M. L. 1990. Structure-activity relationships for amino acid transmitter candidates acting at N-methyl-D-aspartate and quisqualate receptors. J. Neurosci. 10, 2385-2399. Sah, P.,Hestrin, S., and Nicoll, R. A. 1989. Tonic activation of NMDA receptors by ambient glutamate enhances excitability of neurons. Science 246,815-818. Shepherd, G. M., and Brayton, R. K. 1979. Computor simulation of dendrodendritic synaptic circuit for self and lateral inhibition in the olfactory bulb. Brain Res. 175, 377-382. Sillito, A. M. 1987. Functional considerations of the operation of GABAergic inhibitory processes in the visual cortex. In Cerebral Cortex, E. G. Jones and A. Peters, eds., pp. 91-117. Plenum, New York. Stone, T. W., and Burton, N. R. 1988. Nh4DA receptors and ligands in the vertebrate CNS. Prog. Neurobiol. 30, 333-368. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. SOC.London Ser. B 202, 409-416. Trussell, L. O., Thio, L. L., Zorumsky, C. F., and Fischbach, G. D. 1988. Rapid desensitization of glutamate receptors in vertebrate central neurons. Proc. Natl. Acad. Sci. U.S.A. 85, 2834-2838. Tsumoto, T. 1990. Excitatory amino acid transmitters and their receptors in neural circuits of the cerebral neocortex. Neurosci. Res. 9, 79-102. Wilson, C. J. 1984. Passive cable properties of dendritic spines and spiney neurons. J. Neurosci. 4,281-297. Yamada, K. A., Dubinsky, J. M., and Rothman, S. M. 1989. Quantitative physiological characterization of a quinoxalinedione non-NMDA receptor antagonist. J. Neurosci. 9, 3230-3236. Zador, A. M., Koch, C. K., and Brown, T. H. 1989. Biophysical model of a hebbian synapse. Proceedings of the International Joint Conference on Neural Networks, Washington, DC. Received 16 May 1991; accepted 29 July 1991.

This article has been cited by: 1. R. Eckhorn, A.M. Gail, A. Bruns, A. Gabriel, B. Al-Shaikhli, M. Saam. 2004. Different Types of Signal Coupling in the Visual Cortex Related to Neural Mechanisms of Associative Processing and Perception. IEEE Transactions on Neural Networks 15:5, 1039-1052. [CrossRef] 2. R. Eckhorn. 1999. Neural mechanisms of scene segmentation: recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks 10:3, 464-479. [CrossRef] 3. Fiorenzo Conti, Andrea Minelli, Marcherita Molnar, Nicholas C. Brecha. 1994. Cellular localization and laminar distribution of NMDAR1 mRNA in the rat cerebral cortex. The Journal of Comparative Neurology 343:4, 554-565. [CrossRef]

Communicated by Larry Abbott

Alternating and Synchronous Rhythms in Reciprocally Inhibitory Model Neurons Xiao-Jing Wang' John Rinzel Mathematical Research Branch, NIDDK, Bldg. 32, Rm. 4B-54, National Institutes of Health, Bethesda, M D 20892 USA

We study pacemaker rhythms generated by two nonoscillatory model cells that are coupled by inhibitory synapses. A minimal ionic model that exhibits postinhibitory rebound (FIR) is presented. When the postsynaptic conductance depends instantaneously on presynaptic potential the classical alternating rhythm is obtained. Using phase-plane analysis we identify two underlying mechanisms, "release" and "escape," for the out-of-phase oscillation. When the postsynaptic conductance is not instantaneous but decays slowly, the two cells can oscillate synchronously with no phase difference. In each case, different stable activity patterns can coexist over a substantial parameter range. 1 Introduction

Of long-standing interest are questions about rhythm generation in networks of nonoscillatory neurons, where the driving force is not provided by endogenous pacemaking cells. A simple mechanism for this is based on reciprocal inhibition between neurons, provided that they exhibit the property of postinhibitory rebound (PIR) (Perkel and Mulloney 1974). This mechanism has been found experimentally to play a role in the oscillatory behavior of some central pattern generators (CPGs) (Satterlie 1985; Arbas and Calabrese 1987; for a review see Selverston and Moulins 1985). A similar possibility has been suggested in the context of thalamocortical spindling oscillations in mammals, where the pacemaker was hypothesized to arise in the reticular thalamic nucleus, which consists solely of interacting inhibitory cells (Steriade et al. 1990). Recently, ionic mechanisms for PIR have been elucidated in a few cases. For the interneurons that control the leech Hirudo medicinalis heartbeat, PIR is produced by a mixed Na+ /K+ inward "sag" current (Angstadt and Calabrese 1989) while the PIR response of reticular thalamic neurons is due to a low-threshold T-type calcium current (Steriade et al. 1990). The sag 'Present address of Xiao-Jing Wang: Department of Mathematics, University of Chicago, 5734 University Avenue, Chicago, IL 60637 USA.

Neural Computation 4, 84-97 (1992)

@ 1992 Massachusetts Institute of Technology

Rhythms and Synchrony of Inhibitory Neurons

85

current is activated, and the T-type calcium current is deinactivated, by hyperpolarization; thus a rebound excitation can be produced as a result of synaptic inhibition. In the dearth of analysis of biophysical models for multicellular rhythm generation we formulate a simple ionic model for PIR, and analyze the activity of a pair of inhibitory neurons. We show that when the synaptic interaction is assumed instantaneous the classical pattern of an alternating oscillation (Perkel and Mulloney 1974) occurs naturally in this system. Two distinct mechanisms, ”release” and “escape,” are described and analyzed using phase plane techniques from the theory of nonlinear dynamics. In the release case, inhibition is terminated presynaptically (as the active cell returns to rest after its rebound excitation); here, the oscillation period depends sensitively on the duration of synaptic input. In contrast, an escape event is initiated by a postsynaptic cell due to its intrinsic membrane properties: the slowly developing inward current that underlies PIR overcomes the postsynaptic hyperpolarizing current so that an inhibited cell depolarizes on its own account. The oscillation period does not depend on the presynaptic cell’s time course and escape can happen even with nonphasic synaptic input. Finally, if postsynaptic conductance decays slowly then inhibition can outlast the rebound excitation. Consequently, two coupled cells can cycle through the same dynamic phases together; perfect phase synchrony can then be realized. In addition to these rhythmic behaviors we find various bistability phenomena. For example, the cell pair might oscillate, or be at a steady state with one cell inhibited and the other not. 2 A Minimal Ionic Model for PIR

Each model neuron possesses just two nonsynaptic ionic currents: a constant conductance leakage current, IL, and a voltage-dependent inward current referred to as the PIR current, Ipir. The latter is derived from a quantitative model of the T-type calcium current in thalamic neurons (Wang et al. 1991); it activates rapidly and inactivates slowly. For a resting cell, Ipiris strongly inactivated so that hyperpolarization of sufficient duration and amplitude is required to deinactivate Ipir, and thereby to produce a transient inward current and rebound excitation after removal of hyperpolarization. This current does not lead to a regenerative excitation from rest when a depolarizing input is applied. In this sense our model is specifically for the PIR property. In thalamic relay neurons, and in other cell types, the PIR response underlies a burst of rapid sodium action potentials. To retain tractability spikes are not included in our simplified model, but presumably could be. Each neuron is assumed to be electrically compact, and has two dynamic variables: membrane potential V and inactivation h for Ipir;activation m is assumed instantaneous so we set m = m,(V). The model

Xiao-Jing Wang and John Rinzel

86

equations for two identical cells coupled via inhibitory synapses are given by dV C 2 -gPir&(Vi)ki (V; - Vph) - gL (V; - VL) dt

-

gsynsji

(Vi - Vsyn)

where slt is the postsynaptic conductance (fraction of the maximum gsyn) in cell i due to activity in cell j . Except for simulations in the final section, we assume that s,, is an instantaneous, sigmoid function of the presynaptic voltage with a threshold Qsyn, thus

sII= s,(v,)

=

I / (1 + exp[-(V,

-

(2.3)

osyn)/ksyn]}

Computations were done with ksyn = 2 and gsyn= 0.3 mS/cm2. Reversal potentials have the values (in mV> Vplr = 120, VL = -60, and Vsyn = -80. With C = 1 pF/cm2 and gL = 0.1 mS/cm2, the passive membrane time constant 70 is 10 msec. The factor $ scales the kinetics of k; $I = 3 unless stated otherwise. The voltage dependent gating functions are rnco(V)= 1/{1+ exp[-(V 65)/7.8]}, h,(V) = l/{l+ exp[(V 81)/11]}, and q ( V ) = h,(V) exp[(V+162.3)/17.8].The time constant fork (TJJ has a maximal value of 65 msec at about -70 mV. We note in passing that a sag current would have an expression similar to I,,, in equation 2.1, but with the factor m k ( V ) omitted and with some quantitative changes (e.g., much slower time constant for k and different reversal potential). Below we describe the dynamic behavior of the two-cell system and its dependence on the two remaining parameters: the maximum conductance of the PIR current, gplr,which specifies an intrinsic membrane property, and the synaptic threshold, Osynr a coupling parameter. Numerical integrations were carried out using the software package “PHASEPLANE (Ermentrout 1990). The Gear method was used, with the tolerance parameter set to 0.001 or 0.01. The “ A U T O program (Doedel 1981) was used for generating the bifurcation diagrams in Figure 2.

+

+

3 Alternating Oscillation by the Release Mechanism

Figure l a illustrates the classical alternating pattern obtained with our model in the parameter regime for ”release.” This out-of-phase oscillation proceeds as follows: an excited cell (#1)sends inhibitory input to its partner (cell #2), so that a PIR current is deinactivated in the latter. Following the excitation of cell #1, cell #2 is released from hyperpolarization and executes a rebound excitation; this PIR leads to the inhibition of cell #1 in turn. The process repeats itself periodically.

Rhythms and Synchrony of Inhibitory Neurons

a7

V=O (inhibited)

-1'20

'

v -;3m"vv,

2b

Figure 1: Two reciprocally inhibitory neurons establish an alternating rhythmic oscillation by the "release" mechanism. (a) Membrane potential time courses for the out-of-phase oscillation between the two cells. Numerical solution of equations 2.1-2.2 with gpi, = 0.3 mS/cm2, Bsyn = -44 mV. A cell is released from inhibition when the membrane potential of its partner falls below the synaptic threshold, Bsyn (dashed horizontal line). (b) The V - h phase plane of a single neuron; V- and h-nullclines shown both in the absence and presence of an inhibitory synaptic current gsyn(V-Vsyn).The steady state (filled circles) in each case is denoted by V Fand VI, respectively. The arrow on the V-axis indicates the value of Bsyn. Trajectory of a neuron (closed curve with arrowheads) from the two-cell simulation shows that the alternating oscillation consists of repetitive switchings between the "free" single cell configuration (labeled a and b) and the "inhibited" one (labeled c and d).

88

Xiao-Jing Wang and John Rinzel

To dissect this oscillatory behavior mathematically we consider a limiting situation with the parameter ksyn very small, so that the synaptic function S,(V) is either zero or one except at the abrupt transition level V = Bsyn. Then, a neuron is either "free" or "inhibited" when the membrane potential of its partner is, respectively, below or above the synaptic threshold Bsyn. The oscillatory pattern in Figure l a then consists of repetitive switchings between these two states that occur in precise alternation in the two cells. Each state is described by the equations for a single neuron, one with and the other without, a synaptic current, gsyn(V- Vsyn), of constant conductance. Since the single-neuron model has two variabl.es, V and h, we may analyze it by phase-plane methods (Edelstein-Keshet 1988). The nullclines of V and h are the curves obtained by setting dV/dt = 0 and dh/dt = 0, respectively (in equations 2.1-2.2), which yield (3.1) In the free state, the synaptic term is absent in the V-nullcline in equation 3.1. The two V-nullclines, with and without the synaptic term, are displayed in Figure lb, together with the h-nullcline. An intersection point (filled circle) of the V- and h-nullclines corresponds to a timeindependent steady state. We denote the steady-state membrane potential as VF and VI, for the free case and the inhibited one, respectively. In Figure la, with gpir= 0.3 mS/cm2, we have VF = -45 mV, and VI = -74 mV. Also plotted in Figure l b is the V - k trajectory of one neuron, obtained by numerically integrating the full four-variable system equations 2.1-2.2. From this phase-plane portrait we reinterpret the oscillation as follows. Referring to the labels on the trajectory and on the time course in Figure l a we see that phases a and b correspond to the free state while c and d correspond to the inhibited state. During phase a the neuron undergoes its rebound depolarization after release from inhibition. The cell reaches maximum depolarization when its trajectory crosses the "free" Vnullcline. Then V decreases toward its resting value VF (phase b); a slight undershoot of h is seen late in b as the trajectory crosses the k-nullcline. During phase b, as V passes downward through OSyn its partner is released from inhibition. This phase ends forcibly when the cell becomes postsynaptic to its partner (whose membrane potential rises above Bsyn during its rebound excitation). At the time of this transition, one should imagine that the free V-nullcline disappears and is replaced by the inhibited V-nullcline. The current position of (V. h ) (previously, resting) now lies in the region where dV/dt < 0 so the trajectory moves leftward toward the now-applicable, inhibited V-nullcline (phase c). Deinactivation occurs as the trajectory moves upward along the V-nullcline (phase d). Finally, upon being released by its partner, the cell becomes free, the free V-nuIlcline reappears, and the trajectory makes a sharp rightward turn.

Rhythms and Synchrony of Inhibitory Neurons

1

n

89

501

-1

0’ 4 0

I

-30 Synaptic Threshold 0, (mV)

-50

Figure 2: Period of alternating oscillation versus the synaptic threshold Osyn for gpir = 0.3 and 1.0 mS/cm2. Periods for Figures 1 and 3 correspond, respectively, to the values of the curves at esyn= -44 mV. Solid curve: stable; dotted curve: unstable. Beyond a turning point (where solid and dotted curves coalesce), stable oscillation ceases to exist. Periodic solutions emerge via Hopf bifurcation (Edelstein-Keshet1988) at endpoints of curves (filled circles). In the case gpir = 1.0 mS/cm2, we have ”release” for Osyn > -35 mV (= V F )and “escape,” with full inhibition, to the left of the steep region. Notice that, in the release case, the period depends strongly on Bsyn, which controls duration of the synaptic current while in the escape case the period is largely constant. [Note, if the cell were not released, ( V , h ) would continue u p the left branch toward the stable steady state at V,.] The synaptic threshold Osyn is an important control parameter in this release scenario, for two reasons. First, the release event and the oscillatory pattern of Figure 1 are possible only if Qsyn exceeds the resting membrane potential V , of a free cell. The inequality Bsyn > VF means that a neuron at rest is not depolarized enough to inhibit its partner. Moreover, since both cells can thus sit stably at the free resting state, this system is bistable, with the coexistence of a stable steady state and a n oscillatory one. Second, Osyn critically determines the fraction of time of a PIR excitation during which the depolarized neuron effectively sends inhibition to its partner and, thereby, it determines the period of the alternating oscillation. The oscillation period as a function of Bsyn is plotted in Figure 2. Consider first the solid portion of the curve. If Bsyn is too high, no oscillation occurs since the synaptic inhibition generated by a rebound excitation would be too brief to deinactivate the PIR current. The only maintained behavior for the two cells is to be both at rest: V , = V , = V,. Moreover, as long as Osyn > VF, this resting state is expected to be stable. As Osyn is

90

Xiao-Jing Wang and John Rinzel

decreased, sustained pacemaking emerges with minimum period. This period is determined primarily by the time constant T ~ ( Vof) the inactivation gate, which sets the minimum duration of inhibition for a PIR response (Wang et al. 1991, 1992). The period then increases with decreasing Bsyn, and the oscillation disappears as esynapproaches V,. This disappearance occurs either because the period diverges to infinity (Wang et al. 1992) or, as in Figure 2, because the stable periodic orbit coalesces with another coexistent (unstable) periodic orbit. This latter mechanism leads to a ”turning point” (or tangent bifurcation) in a plot of oscillation period (or amplitude) versus parameter such as in Figure 2. It is also how the stable oscillation first appeared for Bsyn equal to about -37 mV. For Osyn below VF, the membrane potential of a free cell would return to VF after excitation but without passing below Osyn. Hence, V1 would remain at VF and the free cell would permanently “enslave” the other cell (at V2 = Vr). Obviously, there are two such asymmetric time-independent solutions, with the role interchanged in the pair. 4 Alternating Oscillation by the Escape Mechanism

In the preceding case, a cell that is inhibited will remain so because, at V,, the deinactivated PIR current (together with I L ) is offset by the inhibitory synaptic current. However, if g,,: is larger the slowly deinactivating Ipir can overcome the hyperpolarizing synaptic current, and we call this escape in the presence of maintained inhibition by a free cell. An example of escape is obtained with the same parameter values as in the release case, except for gpir which is increased from 0.3 to 1.0 mS/cm2 (Fig. 3a). We may distinguish the two cases by comparing their phase plane profiles (Fig. 3b and Fig. lb). The increase of gpir brings about important changes both for the free V-nullcline and the inhibited one. The resting membrane potential of a free neuron is now V, = -35 mV, more positive than Bsyn (-44 mV), so that release becomes impossible. On the other hand, the V-nullcline of an inhibited neuron is lowered by larger gP” (cf. equation 3.1). As a result, the steady state V, is shifted onto the middle branch and is destabilized. The trajectory of an inhibited neuron now reaches larger values of h, along the left branch of the V-nullcline, thereby further deinactivating Ipir.The trajectory is constrained to remain leftward of this branch until it reaches the top of the hump, when it moves rapidly rightward, and the neuron escapes from the inhibition. Unlike the release case, here the switching event is controlled by the inhibited neuron rather than the free one. If switching happens rapidly, then the oscillation period is about twice the time needed for an inhibited neuron to ascend fully the left branch of its V-nullcline. Therefore, the period of oscillation is expected to be insensitive to the synaptic parameter Bsyn in the escape case.

Rhythms and Synchrony of Inhibitory Neurons

0.5

1

91

\h=o

v=o

h

v-gv)

2b

Figure 3: Alternating oscillation by the "escape" mechanism, with gpir = 1.0 mS/cm2 in (a and b) and gpir = 1.5 mS/cm2 in (c and d); Bsyn = -44 mV in both cases. (a) membrane potential versus time reveals higher peaks here, compared to Figure la, due to the increased gpir. From t = 380 to 580 msec, cell #1 is voltage clamped at V1 > Bsyn. Cell #2 receives constant inhibition and executes a self-sustained oscillation [cf. dashed curve in (b)]. (b) Phase-plane portrait with nullclines for "free" and "inhibited" cell. An inhibited neuron can escape from hyperpolarized state because the left "hump" of its V-nullcline has no stable steady state; here, steady state is unstable and surrounded by a stable periodic orbit (dashed closed curve). Continued next page.

For Osyn values higher than VF, release becomes possible again. Either release or escape may occur depending on which of the following two processes is faster: the fall of V from its peak to OSp for the free neuron o r the ascension along the left branch for the inhibited neuron. The period of oscillation versus Bsyn is plotted in Figure 2. The significant increase of the period for Osyn near -45 is reminiscent of the release case. Note that, for Osyn just lower than VF = -36, due to a not so small value of

Xiao-Jing Wang and John Rinzel

92

0'5 h

1I \"-"\

v=o

(inhibited)

Figure 3: Continued from previous page. (c,d) In this escape case, different from (a,b), the inhibited nullcline has a stable steady state on right branch of V nullcline. This leads to bistability for the neuron pair with a stable alternating oscillation and an asymmetric steady state (Vl = - 34.3 mV, hl=0.0141, Vz = -50.5 mV, k2 =0.0587).

ksyn the hyperpolarization from the free cell does not achieve its maximal strength, so that the escape for the inhibited cell is easier and quicker. The idealized escape situation applies only when Qsyn 5 -45 mV, where the period remains virtually constant. Of further interest in this escape case is tha.t an inhibited neuron has a unique steady state, at V,, and it is an unstable spiral. There exists a limit cycle around it (dashed curve in Fig. 3b), that is, a single neuron can be a pacemaker under a constant synaptic inhibitory drive. Thus, if a cell is transiently voltage-clamped to a depolarized level above Osyn, its partner would undergo this self-sustained oscillation (Fig. 3a). Such

Rhythms and Synchrony of Inhibitory Neurons

93

a protocol might be used experimentally to distinguish the release and escape mechanisms. For even higher values of gpir,the inhibited V-nullcline is further lowered, the steady state at VI is shifted near or onto the right branch and it may become stable again (Fig. 3d). Nevertheless, a transient hyperpolarization could force the trajectory to cross the left branch of the V-nullcline. The succeeding large amplitude PIR response could lead to inhibiting the partner, thereby initiating an escape so that an alternating oscillation might be established. In contrast to the previous escape case, here a single inhibited neuron could not oscillate. It is also readily seen from Fig. 3d that the oscillation usually coexists with a stable asymmetric steady state with V1 = VF and V, = VI; for this, we have Vl < Osyn < VF. Figure 3c shows a protocol to detect such a bistability: a transient pulse of hyperpolarization of 50 msec leads to a transition from an asymmetric steady state to a sustained oscillation. Such bistability has been discovered using a similar protocol in laboratory experiments on a subnetwork of the CPG for the leech heartbeat (Arbas and Calabrese 1987; compare their Fig. 7 to our Fig. 3c). This demonstration is a striking indication that the escape mechanism described here may be relevant to that system, which contains pairs of inhibitorily coupled cells possessing a sag current. To draw a closer correspondence between the leech heartbeat and our theoretical model, it should be investigated experimentally whether the oscillatory period in the leech heartbeat interneurons is sensitive to the synaptic parameters that control the duration of postsynaptic inhibitory potentials. One may question how the important prerequiste for the escape mechanism, namely that the resting potential of a free neuron is higher than the synaptic threshold, could be realized in real neuronal systems. The leech heartbeat interneurons exhibit a very slowly decaying plateau potential (Arbas and Calabrese 1987). Therefore, this plateau depolarization may contribute to maintaining a quasistationary potential level in a free cell that is higher than the synaptic threshold, at least during the phase just prior to the escape event of a contralateral partner.

5 Synchronization by a Slowly Decaying Synaptic Activation

~

In accord with common wisdom, we have shown that neurons of an inhibitory pair tend to oscillate out-of-phase. Indeed, when s,, depends instantaneously on V it would be impossible to imagine a pattern in which two cells were simultaneously inhibited. However, another possibility arises if the synaptic action displays a slow time course, so that inhibition can outlast the PIR event. We report here an example in which our two model cells can be brought into perfect synchrony by the effects of a slow synaptic decay. Assume now that the synaptic variables s,j obey

94

Xiao-Jing Wang and John Rinzel

first-order kinetics, described by

Then, if k, is sufficiently small, both cells would "feel" perpetually some average synaptic input. If, in addition, the PIR is strong, oscillation in two cells is possible. Oscillatory synaptic inputs around the average communicate phasic information between the two cells which may allow them to synchronize. Such an in-phase oscillation is shown in Figure 4, together with two other coexisting attractors: an out-of-phase oscillatory state and an asymmetric steady state. A depolarizing pulse applied simultaneously to both cells induces a transition from the asymmetric steady state to the synchronized oscillatory state; whereas another, asymmetric perturbation (current pulse of same duration and same absolute intensity, but depolarizing to one and hyperpolarizing to the other) would lead to the out-of-phase oscillatory state, thus desynchronizing the system. The in-phase oscillation uncovered here seems suggestive for the reticular thalamic nucleus, where inhibitory cells interact with each other via GABAergic synapses that usually possess a slow component and where spindling oscillations are marked by a high degree of synchronization in the thalamocortical system (Steriade et al. 1990). 6 Discussion

We have explored, via simulation and analysis, the, activity patterns supported by biophysically meaningful model cells which exhibit PIR and which are coupled by reciprocal inhibition. Two cells (each nonautorhythmic) can generate oscillatory patterns with the cells either outof-phase or, surprisingly, in-phase. The former, the classical alternating pattern, arises ubiquitously when post-synaptic variable, sji, depends instantaneously on pre-synaptic potential, Vpre.In-phase rhythms can occur when sji decays slowly after a transient depolarization by V,,,. In either case, these behaviors are not unique; in some parameter regimes two, or more, stable activity patterns coexist. Our simplified ionic model for a cell has only two membrane variables. By applying phase plane techniques, when sli is an instantaneous and steep sigmoidal function of Vpre,we find that two distinct mechanisms, "release" or "escape," underlie the alternating oscillation. In the first case, but not the second, the oscillation period depends sensitively on the duration of synaptic hyperpolarization. In "escape," a cell's intrinsic properties allow rebound excitation [and, in some parameter regimes (cf. Fig. 3a,b), sustained oscillation] to proceed even under maintained synaptic inhibition. In either case bistability may occur, where the oscillation coexists with a stationary pattern of both cells resting, or one

Rhythms and Synchrony of Inhibitory Neurons

95

-34 -6

-9d

-1l 1

n

n

11

3 -! -6 -9d

I

0

500

1000 t (ms)

1500

1

2000

Figure 4: The two-cell system, equations 2.1-2.2, can oscillate in-phase when the post-synaptic conductance obeys first order kinetics (equation 5.1)with slow decay rate. Cells are started in a stable asymmetric steady state with V1 = -37 mV, V2 = -72 mV. Compare the membrane potential and synaptic activation time courses to see that cell #1 is free (s21 = 0) and cell #2 is inhibited (s12 = 1). The cells are switched into a synchronous oscillation when, at t = 300 msec, a depolarizing current pulse of intensity 1.0 pA/cm2 and of duration 50 msec is delivered to each cell. The time courses of s,j during this phase exhibit rapid onset but slow decay of synaptic activation. Then, at t = 1100 msec, a depolarizing current pulse to cell #1 (intensity, 1.0 ,uA/cm2; duration, 50 msec) and a hyperpolarizing one to cell #2 (intensity, -1.0 ,uA/cm2; duration, 50 msec) send the neuronal pair into an out-of-phase periodic pattern. This system thus has at least three coexistent maintained activity patterns for these parameter values, which differ from those of previous figures as follows: gpir= 0.5 mS/cm2, gsyn= 0.2 mS/cm2, gL = 0.05 mS/cm2, Q, = 2, Bsyn = -35 mV, and k, = 0.005.

96

Xiao-Jing Wang and John Rinzel

resting and one inhibited, respectively. Our results suggest that the "release" and "escape" cases may perhaps be distinguished experimentally by selectively modulating parameters that control synaptic activation, particularly synaptic duration; or those that control the inward current that is unmasked by hyperpolarization and that underlies PIR. In an early modeling study (Reiss 1962), stable generation of an alternating rhythm relied on the fatigue of synapses. Later (Perkel and Mulloney 1974), PIR was proposed as an alternative mechanism, originating as an intrinsic cellular rather than a coupling property. The ionic basis of the PIR, however, was not identified and modeled in their study. Here, we have shown explicitly that inactivation of Zpir can play such a role. Either release occurs naturally as Vpre falls toward rest after its rebound excitation, as lpir inactivates; or the inhibited cell escapes on its own, as deinactivation allows Ipirto overcome the synaptic current. We note that rebound excitation could also result from deactivation of an outward current. Although "release" would still be possible, it appears that for "escape" to occur an additional factor, beyond such a n outward current, would be necessary. We have presented results only for two coupled inhibitory cells, but our interest extends to larger ensembles, for example, in connection with the reticular thalamic nucleus as a pacemaker for thalamocortical bursting oscillations. Our preliminary simulations have shown that slow decay of synaptic actions can lead to total synchronization in larger networks where inhibition is widespread (all-to-all coupling).

Acknowledgments We thank Dr. Arthur Sherman for a careful reading of our manuscript.

References Angstadt, J. D., and Calabrese, R. L. 1989. A hyperpolarization-activatedinward current in heart interneurons of the medicinal leech. J. Neurophysiol. 9,28462857. Arbas, E. A,, and Calabrese, R. L. 1987. Slow oscillations of membrane potential in interneurons that control heartbeat in the medicinal leech. J. Neurosci. 7, 3953-3960. Doedel, E. 1981. AUTO: A program for the automatic bifurcation analysis of autonomous systems. Cong. Num. 30,265-284. Edelstein-Keshet,L. 1988. Mathematical Models in Biology. Random House, New York. Ermentrout, G. B. 1990. PHASEPLANE: The Dynamzcal Systems Tool, Version 3.0. Brooks/Cole Publishing Co., Pacific Grove, CA. Perkel, D. H., and Mulloney, B. 1974. Motor pattern production in reciprocally inhibitory neurons exhibiting postinhibitory rebound. Science 185, 181-183.

Rhythms and Synchrony of Inhibitory Neurons

97

Reiss, R. F. 1962. A theory and simulation of rhythmic behavior due to reciprocal inhibition in small nerve nets. Proc. M I P S Spring Joint Comput. Conf. 21, 171-1 94. Satterlie, R. A. 1985. Reciprocal inhibition and postinhibitory rebound produce reverbation in a locomotor pattern generator. Science 229, 402-404. Selverston, A. I., and Moulins, M. 1985. Oscillatory neural networks. Annu. Rev. Physiol. 47, 2948. Steriade, M., Jones, E. G., and Llinb, R. R. 1990. Thalamic Oscillations and Signaling. John Wiley, New York. Wang, X.-J., Rinzel, J., and Rogawski, M. A. 1991. A model of the T-type calcium current and the low-threshold spikes in thalamic neurons. J, Neurophysiol. 66,839450. Wang, X.-J., Rinzel, J., and Rogawski, M. A. 1992. Low threshold spikes and rhythmic oscillations in thalamic neurons. In Analysis and Modeling of Neural Systems, F.Eeckman, ed., pp. 85-92. Kluwer, Boston.

Received 8 July 1991; accepted 12 August 1991.

This article has been cited by: 2. I. Belykh, S. Jalil, A. Shilnikov. 2010. Burst-duration mechanism of in-phase bursting in inhibitory networks. Regular and Chaotic Dynamics 15:2-3, 146-158. [CrossRef] 3. Sachin S. Talathi, Dong-Uk Hwang, Paul R. Carney, William L. Ditto. 2010. Synchrony with shunting inhibition in a feedforward inhibitory network. Journal of Computational Neuroscience 28:2, 305-321. [CrossRef] 4. Sajiya Jalil, Igor Belykh, Andrey Shilnikov. 2010. Fast reciprocal inhibition can synchronize bursting neurons. Physical Review E 81:4. . [CrossRef] 5. Asya Shpiro, Ruben Moreno-Bote, Nava Rubin, John Rinzel. 2009. Balance between noise and adaptation in competition models of perceptual bistability. Journal of Computational Neuroscience 27:1, 37-54. [CrossRef] 6. Silvia Daun, Jonathan E. Rubin, Ilya A. Rybak. 2009. Control of oscillation periods and phase durations in half-center central pattern generators: a comparative mechanistic analysis. Journal of Computational Neuroscience 27:1, 3-36. [CrossRef] 7. Ramana Dodla, Charles Wilson. 2009. Asynchronous Response of Coupled Pacemaker Neurons. Physical Review Letters 102:6. . [CrossRef] 8. Yuzhu Xiao, Wei Xu, Xiuchun Li, Sufang Tang. 2009. The effect of noise on the complete synchronization of two bidirectionally coupled piecewise linear chaotic systems. Chaos: An Interdisciplinary Journal of Nonlinear Science 19:1, 013131. [CrossRef] 9. John Guckenheimer, Christian Kuehn. 2009. Computing Slow Manifolds of Saddle Type. SIAM Journal on Applied Dynamical Systems 8:3, 854. [CrossRef] 10. Sachin S. Talathi, Dong-Uk Hwang, William L. Ditto. 2008. Spike timing dependent plasticity promotes synchrony of inhibitory networks in the presence of heterogeneity. Journal of Computational Neuroscience 25:2, 262-281. [CrossRef] 11. Igor Belykh, Andrey Shilnikov. 2008. When Weak Inhibition Synchronizes Strongly Desynchronizing Networks of Bursting Neurons. Physical Review Letters 101:7. . [CrossRef] 12. K.L. Briggman, W.B. Kristan. 2008. Multifunctional Pattern-Generating Circuits. Annual Review of Neuroscience 31:1, 271-294. [CrossRef] 13. Xiao Yu-Zhu, Xu Wei, Li Xiu-Chun, Tang Su-Fang. 2008. Complete synchronization of uncertain chaotic dynamical network via a simple adaptive control. Chinese Physics B 17:1, 80-86. [CrossRef] 14. Andrey Shilnikov, René Gordon, Igor Belykh. 2008. Polyrhythmic synchronization in bursting networking motifs. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:3, 037120. [CrossRef]

15. Rodica Curtu, Asya Shpiro, Nava Rubin, John Rinzel. 2008. Mechanisms for Frequency Control in Neuronal Competition Models. SIAM Journal on Applied Dynamical Systems 7:2, 609. [CrossRef] 16. Victor Matveev, Amitabha Bose, Farzan Nadim. 2007. Capturing the bursting dynamics of a two-cell inhibitory network using a one-dimensional map. Journal of Computational Neuroscience 23:2, 169-187. [CrossRef] 17. Janet Best, Choongseok Park, David Terman, Charles Wilson. 2007. Transitions between irregular and rhythmic firing patterns in excitatory-inhibitory neuronal networks. Journal of Computational Neuroscience 23:2, 217-235. [CrossRef] 18. Yuzhu Xiao, Wei Xu, Xiuchun Li, Sufang Tang. 2007. Adaptive complete synchronization of chaotic dynamical network with unknown and mismatched parameters. Chaos: An Interdisciplinary Journal of Nonlinear Science 17:3, 033118. [CrossRef] 19. Marlene Bartos, Imre Vida, Peter Jonas. 2007. Synaptic mechanisms of synchronized gamma oscillations in inhibitory interneuron networks. Nature Reviews Neuroscience 8:1, 45-56. [CrossRef] 20. Allen I. Selverston, Joseph Ayers. 2006. Oscillations and oscillatory behavior in small neural circuits. Biological Cybernetics 95:6, 537-554. [CrossRef] 21. Murat Sekerli, Robert J. Butera. 2005. Oscillations in a Simple Neuromechanical System: Underlying Mechanisms. Journal of Computational Neuroscience 19:2, 181-197. [CrossRef] 22. Antonio Palacios, Ricardo Carretero-González, Patrick Longhini, Norbert Renz, Visarath In, Andy Kho, Joseph Neff, Brian Meadows, Adi Bulsara. 2005. Multifrequency synthesis using two coupled nonlinear oscillator arrays. Physical Review E 72:2. . [CrossRef] 23. Benjamin Pfeuty , Germán Mato , David Golomb , David Hansel . 2005. The Combined Effects of Inhibitory and Electrical Synapses in SynchronyThe Combined Effects of Inhibitory and Electrical Synapses in Synchrony. Neural Computation 17:3, 633-670. [Abstract] [PDF] [PDF Plus] 24. Christoph Börgers , Nancy Kopell . 2005. Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory NeuronsEffects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons. Neural Computation 17:3, 557-608. [Abstract] [PDF] [PDF Plus] 25. E. I. Volkov. 2005. Limit Cycles Arising in a Chain of Inhibitorily Coupled Identical Relaxation Oscillators Near the Self-Oscillation Threshold. Radiophysics and Quantum Electronics 48:3, 212-221. [CrossRef] 26. Alain Destexhe, Eve Marder. 2004. Plasticity in single neuron and circuit computations. Nature 431:7010, 789-795. [CrossRef] 27. Brent Doiron, Benjamin Lindner, André Longtin, Leonard Maler, Joseph Bastian. 2004. Oscillatory Activity in Electrosensory Neurons Increases with the

Spatial Correlation of the Stochastic Input Stimulus. Physical Review Letters 93:4. . [CrossRef] 28. F.A. Mussa-Ivaldi, S.A. Solla. 2004. Neural Primitives for Motion Control. IEEE Journal of Oceanic Engineering 29:3, 640-650. [CrossRef] 29. Jose L. Perez Velazquez. 2003. Bicarbonate-dependent depolarizing potentials in pyramidal cells and interneurons during epileptiform activity. European Journal of Neuroscience 18:5, 1337-1342. [CrossRef] 30. David Chik, Z. Wang. 2003. Postinhibitory rebound delay and weak synchronization in Hodgkin-Huxley neuronal networks. Physical Review E 68:3. . [CrossRef] 31. D. Hansel , G. Mato . 2003. Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory NeuronsAsynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons. Neural Computation 15:1, 1-56. [Abstract] [PDF] [PDF Plus] 32. Yasuomi Sato, Masatoshi Shiino. 2002. Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Physical Review E 66:4. . [CrossRef] 33. Antonis Karantonis, Yasuyuki Miyakita, Seiichiro Nakabayashi. 2002. Synchronization of coupled assemblies of relaxation oscillatory electrode pairs. Physical Review E 65:4. . [CrossRef] 34. Adam L. Taylor , Garrison W. Cottrell , William B. Kristan, Jr. . 2002. Analysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic DepressionAnalysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic Depression. Neural Computation 14:3, 561-581. [Abstract] [PDF] [PDF Plus] 35. Jan Karbowski, G. Ermentrout. 2002. Synchrony arising from a balanced synaptic plasticity in a network of heterogeneous neural oscillators. Physical Review E 65:3. . [CrossRef] 36. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 37. S. Coombes, M Owen, G. Smith. 2001. Mode locking in a periodically forced integrate-and-fire-or-burst neuron model. Physical Review E 64:4. . [CrossRef] 38. F. K. Skinner, C. Wu, L. Zhang. 2001. Phase-coupled oscillator models can predict hippocampal inhibitory synaptic connections. European Journal of Neuroscience 13:12, 2183-2194. [CrossRef] 39. Amitabha Bose, Michael Recce. 2001. Phase precession and phase-locking of hippocampal pyramidal cells. Hippocampus 11:3, 204-215. [CrossRef]

40. I. Susa, T. Erneux, A. Barsella, C. Lepers, D. Dangoisse, P. Glorieux. 2000. Synchronization through bursting oscillations for two coupled lasers. Physical Review A 63:1. . [CrossRef] 41. Gennady S. Cymbalyuk , Girish N. Patel , Ronald L. Calabrese , Stephen P. DeWeerth , Avis H. Cohen . 2000. Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI ApproachModeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach. Neural Computation 12:10, 2259-2278. [Abstract] [PDF] [PDF Plus] 42. Jan Karbowski , Nancy Kopell . 2000. Multispikes and Synchronization in a Large Neural Network with Temporal DelaysMultispikes and Synchronization in a Large Neural Network with Temporal Delays. Neural Computation 12:7, 1573-1606. [Abstract] [PDF] [PDF Plus] 43. J. Rubin , D. Terman . 2000. Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal NetworksGeometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks. Neural Computation 12:3, 597-645. [Abstract] [PDF] [PDF Plus] 44. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 45. P. C. Bressloff, S. Coombes. 2000. A Dynamical Theory of Spike Train Transitions in Networks of Integrate-and-Fire Oscillators. SIAM Journal on Applied Mathematics 60:3, 820. [CrossRef] 46. Jose L. Perez Velazquez, Peter L. Carlen. 1999. Synchronization of GABAergic interneuronal networks during seizure-like activity in the rat horizontal hippocampal slice. European Journal of Neuroscience 11:11, 4110-4118. [CrossRef] 47. Seon Park, Seunghwan Kim, Hyeon-Bong Pyo, Sooyeul Lee. 1999. Multistability analysis of phase locking patterns in an excitatory coupled neural system. Physical Review E 60:2, 2177-2181. [CrossRef] 48. A. SELVERSTON, R. ELSON, M. RABINOVICH, R. HUERTA, H. ABARBANEL. 1998. Basic Principles for Generating Motor Output in the Stomatogastric Ganglion. Annals of the New York Academy of Sciences 860:1 NEURONAL MECH, 35-50. [CrossRef] 49. P. Bressloff, S. Coombes. 1998. Desynchronization, Mode Locking, and Bursting in Strongly Coupled Integrate-and-Fire Oscillators. Physical Review Letters 81:10, 2168-2171. [CrossRef] 50. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 51. S. Coombes, G. Lord. 1997. Intrinsic modulation of pulse-coupled integrate-and-fire neurons. Physical Review E 56:5, 5809-5818. [CrossRef]

52. David Terman, Euiwoo Lee. 1997. Partial Synchronization in a Network of Neural Oscillators. SIAM Journal on Applied Mathematics 57:1, 252. [CrossRef] 53. C. Vreeswijk. 1996. Partial synchronization in populations of pulse-coupled oscillators. Physical Review E 54:5, 5522-5537. [CrossRef] 54. S. Coombes, S. Doole. 1996. Neuronal populations with reciprocal inhibition and rebound currents: Effects of synaptic and threshold noise. Physical Review E 54:4, 4054-4065. [CrossRef] 55. William W. Lytton . 1996. Optimizing Synaptic Conductance Calculation for Network SimulationsOptimizing Synaptic Conductance Calculation for Network Simulations. Neural Computation 8:3, 501-509. [Abstract] [PDF] [PDF Plus] 56. Wei-Ping Wang . 1996. Binary-Oscillator Networks: Bridging a Gap between Experimental and Abstract Modeling of Neural NetworksBinary-Oscillator Networks: Bridging a Gap between Experimental and Abstract Modeling of Neural Networks. Neural Computation 8:2, 319-339. [Abstract] [PDF] [PDF Plus] 57. Galina N. Borisyuk, Roman M. Borisyuk, Alexander I. Khibnik, Dirk Roose. 1995. Dynamics and bifurcations of two coupled neural oscillators with different connection types. Bulletin of Mathematical Biology 57:6, 809-840. [CrossRef] 58. Seung Han, Christian Kurrer, Yoshiki Kuramoto. 1995. Dephasing and Bursting in Coupled Neural Oscillators. Physical Review Letters 75:17, 3190-3193. [CrossRef] 59. Rolf K�tter, Jeff Wickens. 1995. Interactions of glutamate and dopamine in a computational model of the striatum. Journal of Computational Neuroscience 2:3, 195-214. [CrossRef] 60. Farzan Nadim, �ystein H. Olsen, Erik Schutter, Ronald L. Calabrese. 1995. Modeling the leech heartbeat elemental oscillator I. Interactions of intrinsic and synaptic currents. Journal of Computational Neuroscience 2:3, 215-235. [CrossRef] 61. Ronald L. Calabrese, Farzan Nadim, �ystein H. Olsen. 1995. Heartbeat control in the medicinal leech: A model system for understanding the origin, coordination, and modulation of rhythmic motor patterns. Journal of Neurobiology 27:3, 390-402. [CrossRef] 62. Maureen E. Rush, John Rinzel. 1994. Analysis of bursting in a thalamic neuron model. Biological Cybernetics 71:4, 281-291. [CrossRef] 63. Alain Destexhe, Zachary F. Mainen, Terrence J. Sejnowski. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. Journal of Computational Neuroscience 1:3, 195-230. [CrossRef] 64. Alain Destexhe. 1994. Oscillations, complex spatiotemporal behavior, and information transport in networks of excitatory and inhibitory neurons. Physical Review E 50:2, 1594-1606. [CrossRef]

65. Thomas LoFaro , Nancy Kopell , Eve Marder , Scott L. Hooper . 1994. Subharmonic Coordination in Networks of Neurons with Slow ConductancesSubharmonic Coordination in Networks of Neurons with Slow Conductances. Neural Computation 6:1, 69-84. [Abstract] [PDF] [PDF Plus] 66. David Golomb, John Rinzel. 1993. Dynamics of globally coupled inhibitory neurons with heterogeneity. Physical Review E 48:6, 4810-4814. [CrossRef] 67. Frances K. Skinner, Gina G. Turrigiano, Eve Marder. 1993. Frequency and burst duration in oscillating neurons and two-cell networks. Biological Cybernetics 69:5-6, 375-383. [CrossRef] 68. CC Canavier, DA Baxter, JH ByrneRepetitive Action Potential Firing . [CrossRef] 69. Xiao-Jing WangNeural Oscillations . [CrossRef]

Communicated by Geoffrey Hinton

Feature Extraction Using an Unsupervised Neural Network Nathan Intrator Center for Neural Science, Brown University Providence, NO2912 USA

A novel unsupervised neural network for dimensionality reduction that seeks directions emphasizing multimodality is presented, and its connection to exploratory projection pursuit methods is discussed. This leads to a new statistical insight into the synaptic modification equations governing learning in Bienenstock, Cooper, and Munro (BCM) neurons (1982). The importance of a dimensionality reduction principle based solely on distinguishing features is demonstrated using a phoneme recognition experiment. The extracted features are compared with features extracted using a backpropagation network.

1 Introduction When a classification of high-dimensional vectors is sought, the curse of dimensionality (Bellman 1961) becomes the main factor affecting the classification performance. The curse of dimensionality is due to the inherent sparsity of high-dimensional spaces, implying that, in the absence of simplifying assumptions, the amount of training data needed to get reasonably low variance estimators is ridiculously high. This has led many researchers in recent years to construct methods that specifically avoid this problem (see Geman et al. 1991 for review in the context of neural networks). One approach is to assume that important structure in the data actually lies in a much smaller dimensional space, and therefore try to reduce the dimensionality before attempting the classification. This approach can be successful if the dimensionality reduction/ feature extraction method loses as little relevant information as possible in the transformation from the high-dimensional space to the low-dimensional one. Performing supervised feature extraction using the class labels is sensitive to the dimensionality in a similar manner to a high-dimensional classifier, and may result in a strong bias to the training data leading to poor generalization properties of the resulting classifier (Barron and Barron 1988). A general class of unsupervised dimensionality reduction methods, called exploratory projection pursuit, is based on seeking interesting projections of high-dimensional data points (Kruskal 1972; Friedman and Neural Computation 4, 98-107 (1992)

@ 1992 Massachusetts Institute of Technology

Feature Extraction Using an Unsupervised Neural Network

99

Tukey 1974; Friedman 1987; Huber 1985, for review). The notion of interesting projections is motivated by an observation made by Diaconis and Freedman (1984) that for most high-dimensional clouds, most lowdimensional projections are approximately normal. This finding suggests that important information in the data is conveyed in those directions whose single-dimensional projected distribution is far from gaussian. Various projection indices differ on the assumptions about the nature of deviation from normality, and in their computational efficiency. Friedman (1987) argues that the most computationally efficient measures are based on polynomial moments. However, polynomial moments heavily emphasize departure from normality in the tails of the distribution (Huber 1985). Moreover, although many synaptic plasticity models are based on second-order statistics and lead to extraction of the principal components (Sejnowski 1977; von der Malsburg 1973; q a 1982; Miller 1988; Linsker 1988), second-order polynomials are not sufficient to characterize the important features of a distribution (see examples in Duda and Hart 1973, p. 212). This suggests that in order to use polynomials for measuring deviation from normality, higher order polynomials are required, and care should be taken to avoid their oversensitivity to outliers. In this paper, the observation that high-dimensional clusters translate to multimodal low-dimensional projections is used for defining a measure of multimodality for seeking interesting projections. In some special cases, where the data are known in advance to be bimodal, it is relatively straightforward to define a good projection index (Hinton and Nowlan 1990), however, when the structure is not known in advance, defining a general multimodal measure of the projected data is not straightforward, and will be discussed in this paper. There are cases in which it is desirable to make the projection index invariant under certain transformations, and maybe even remove second-order structure (see Huber 1985 for desirable invariant properties of projection indices). In those cases it is possible to make such transformations beforehand (Friedman 1987), and then assume that the data possess these invariant properties. 2 Feature Extraction Using ANN

In this section, the intuitive idea presented above is used to form a statistically plausible objective function whose minimization will find those projections having a single-dimensional projected distribution that is far from gaussian. This is done using'a loss function that has an expected value that leads to the desired projection index. Mathematical details are given in Intrator (1990). Before presenting our version of the loss function, we review some necessary notation and assumptions. Consider a neuron with input vector x = ( X I ,..., xN), synaptic weight vector m = (ml,... , m ~ )both , in

Nathan Intrator

100

RN,and activity (in the linear region) c

-

=x

. m. Define the threshold

0, = E [ ( x m)’], and the functions $(c, 0,) = c2 - (2/3)~0m,4(c, 0,) =

c2 - (4/3)c0,. The 4 function has been suggested as a biologically plausible synaptic modification function to explain visual cortical plasticity (Bienenstock et al. 1982). 0, is a dynamic threshold that will be shown later to have an effect on the sign of the synaptic modification. The input x , which is a stochastic process, is assumed to be of Type I1 p mixing,’ bounded, and piecewise constant. These assumptions are plausible, since they represent the closest continuous approximation to the usual training algorithms, in which training patterns are presented at random. The p mixing property allows for some time dependency in the presentation of the training patterns. These assumptions are needed for the approximation of the resulting deterministic gradient descent by a stochastic one (Intrator and Cooper 1991). For this reason we use a learning rate p that has to decay in time so that this approximation is valid. We want to base the projection index on polynomial moments of low order, and to use the fact that a projection that leads to a bimodal distribution is already interesting, and any additional mode in the projected distribution should make the projection even more interesting. With this in mind, consider the following family of loss functions that depends on the synaptic weight vector m and on the input x ;

The motivation for this loss function can be seen in Figure I, which represents the $ function and the associated loss function Lm(c). For simplicity the loss for a fixed threshold 0, and synaptic vector m can be written as Lm(c)= - ( p / 3 ) c 2 ( c - em), where c = ( x . m). The graph of the loss function shows that for any fixed m and Om,the loss is small for a given input x , when either c = x . m is close to zero, or when x . m is larger than 0,. Moreover, the loss function remains negative for ( x .m) > Om,therefore any kind of distribution at the right-hand side of 0, is possible, and the preferred ones are those that are concentrated further from 0,. It remains to be shown why it is not possible that a minimizer of the average loss will be such that all the mass of the distribution will be concentrated on one side of 0,. This can not happen because the threshold 0, is dynamic and depends on the projections in a nonlinear way, namely, 0, = E ( x . ~ )This ~ . implies that 0, will always move itself to a position such that the distribution will never be concentrated at only one of its sides. The risk (expected value of the loss) is given by Rm

=

P

--

3

{ E [ ( x * W I ) ~] E ’ [ ( x . m)’]}

‘The (p mixing property specifies the dependency of the future of the process on its past.

Feature Extraction Using an Unsupervised Neural Network

101

Figure 1: The function (b and the loss functions for a fixed rn and 0,. Since the risk is continuously differentiable, its minimization can be achieved via a gradient descent method with respect to m, namely

The resulting differential equations give a modified version of the law governing synaptic weight modification in the BCM theory for learning and memory (Bienenstock et al. 1982). This theory was presented to account for various experimental results in visual cortical plasticity. The modification lies in the way the threshold 0, is calculated. In the original form this threshold was 0, = EP(c) for p > 1, while in the current form 0, = E(cP) for p > 1. The latter takes into account the variance of the activity (for p = 2) and therefore is always positive; this ensures stability even when the average of the inputs is zero. The biological relevance of the theory has been extensively studied (Bear et al. 1987; Bear and Cooper 1988) and it was shown that the theory is in agreement with the classical deprivation experiments (Clothiaux et al. 1991). The fact that the distribution has part of its mass on both sides of 0, makes this loss a plausible projection index that seeks multimodalities. However, we still need to reduce the sensitivity of the projection index to outliers, and for full generality, allow any projected distribution to be shifted so that the part of the distribution that satisfies c < 0, will have its mode at zero. The oversensitivity to outliers is addressed by considering a nonlinear neuron in which the neuron’s activity is defined to be c = a(x.m),where a usually represents a smooth sigmoidal function. A more general definition that would allow symmetry breaking of the projected distributions, as well as provide a solution to the second problem raised above, and will still be consistent with the statistical formulation, is c = o ( x . m - a ) , for an arbitrary threshold a. The threshold a can be found by using gradient descent as well. For the nonlinear neuron, 0, is defined

Nathan Intrator

102

to be 0,

= E[a*(x. m ) ] .The

loss function is given by

The gradient of the risk becomes -V,Rm

=p

E[d (u(x m ) ,0,)

u’x]

where u’ represents the derivative of u at the point ( x . m). Note that the multiplication by u’ reduces sensitivity to outliers of the differential equation since for outliers u‘ is close to zero. The gradient descent is valid, provided that the risk is bounded from below. Based on this formulation, a network of Q identical nodes may be constructed. All the neurons in this network receive the same input and inhibit each other, so as to extract several features in parallel. The relation between this network and the network studied by Cooper and Scofield (1988) is discussed in Intrator and Cooper (1991). The activity of neuron k in the network is defined as ck = u ( x .?nk - nk),where mk is the synaptic weight vector of neuron k, and Nk is its threshold. The inhibited activity and threshold of the kth neuron are given by ck = ck - 77 &#k Cj, @, = E[ci].A more general inhibitory pattern such as a Mexican hat is possible with minor changes in the mathematical details. We omit the derivation of the synaptic modification equations, and present only the resulting stochastic modification equations for a synaptic vector mk in a lateral inhibition network of nonlinear neurons:

hk

= /L [$(ck, bk,)d(sk) - 7

$(zj,

@,i)d(?f)]x

i#k

The lateral inhibition network performs a direct search of Q-dimensional projections in parallel, and therefore may find a richer structure that a step wise approach may miss (see example 14.1 in Huber 1985). 3 Comparison with Other Feature Extraction Methods

The above feature extraction method has been applied so far to various high-dimensional classification problems: extracting rotation invariant features from 3D wire-like objects (Intrator and Gold 1991) based on a set of sophisticated psychophysical experiments (Edelman and Bulthoff 1991); feature extraction from the TIMIT speech data base using Lyon’s Cochlea model (Intrator and Tajchman 1991). The dimensionality of the feature extraction problem for these experiments was 3969 and 5500 dimensions, respectively. It is surprising that a very moderate amount of training data was needed for extracting robust features as will be shown below. In this section we briefly describe a linguistically motivated feature extraction experiment from stop consonants. We compare classification performance of the proposed method to a network that performs

Feature Extraction Using an Unsupervised Neural Network

i

Unsupervised Feature Extractlon

U

01

1

20

r

Labeis

103

LowDim Classifier

-

Figure 2: Low-dimensional classifier is trained on features extracted from the high-dimensional data. Training of the feature extraction network stops when the misclassification rate drops below a predetermined threshold on either the same training data (cross-validatorytest) or on different testing data. dimensionality reduction based on minimization of misclassification error (using backpropagation with MSE criterion). In the latter we regard the hidden unit representation as a new reduced feature representation of the input space. Classification on the new feature space was done using backpropagation.2 The unsupervised feature extraction/classification method is presented in Figure 2. The pixel images corresponding to speech data, are shown in Figure 3. Similar approaches using the RCE and backpropagation network have been carried out by Reilly et al. (1988). The following describes the linguistic motivation of the experiment. Consider the six stop consonants [p,k,t,b,g,dl, which have been a subject of recent research in evaluating neural networks for phoneme recognition (see review in Lippmann 1989). According to phonetic feature theory, these stops possess several common features, but only two distinguishing phonetic features, place of articulation and voicing (see Lieberman and Blumstein 1988, for a review and related references on phonetic feature theory). This theory suggests an experiment in which features extracted from unvoiced stops can be used to distinguish place of articulation in voiced stops as well. It is of interest if these features can be found from a single speaker, how sensitive they are to voicing and whether they are speaker invariant. The speech data consists of 20 consecutive time windows of 32 msec with 30 msec overlap, aligned to the beginning of the burst. In each time window, a set of 22 energy levels is computed. These energy levels cor%e Intrator (1990) for comparisonwith principal components feature extraction and with k-NN as a classifier.

104

Nathan Intrator

Figure 3: An average of the six stop consonants followed by the vowel [a]. Their order from left to right [pa] [ba] [ka] [gal ita] [da]. Time increases from the burst release on the X axis, and frequency increases on the Y axis. Brighter areas correspond to stronger energy.

respond to Zwicker critical band filters (Zwicker 1961). The consonantvowel (CV) pairs were pronounced in isolation by native American speakers (two male BSS and LTN, and one female JES.) Additional details on biological motivation for the preprocessing, and linguistic motivation related to child language acquisition can be found in Seebach (1990). An average (over 25 tokens) of the six stop consonants followed by the vowel [a] is presented in Figure 3. All the images are smoothed using a moving average. One can see some similarities between the voiced and unvoiced stops especially in the upper left corner of the image (high frequencies beginning of the burst) and the radical difference between them in the low frequencies. In the experiments reported here, five features were extracted from the 440 dimension original space. Although the dimensionality reduction methods were trained only with the unvoiced tokens of a single speaker, the classifier was trained on (five-dimensional)voiced and unvoiced data from the other speakers as well. The classification results, which are summarized in Table 1, show that the backpropagation network does well in finding structure useful for classification of the trained data, but this structure is more sensitive to voicing. Classification results using a BCM network suggest that for this specific task structure that is less sensitive to voicing can be extracted, even though voicing has significant effects on the speech signal itself. The results also suggest that these features are more speaker invariant. The difference in performance between the two feature extractors may be partially explained by looking at the synaptic weight vectors (images) extracted by both methods (Fig. 4): For the backpropagation feature extraction it can be seen that although five units were used, less features were extracted. One of the main distinctions between the unvoiced stops in the training set is the high frequency burst at the beginning of the consonant (the upper left corner). The backpropagation method concentrated mainly on this feature, probably because it is sufficient to base the recognition of the training set on this feature, and the fact that training

Feature Extraction Using an Unsupervised Neural Network

105

Table 1: Percentage of Correct Classificationof Place of Articulation in Voiced and Unvoiced Stops. Place of articulation classification (B-P)

BSS /p,k,t/ BSS /b,g,d/ LTN /p,k,t/ LTN /b,g,d/ JES (both)

B-P (%)

BCM (%)

100 83.4 95.6 78.3 88.0

100 94.7 97.7 93.2 99.4

Figure 4: Synaptic weight images of the five hidden units of backpropagation (top), and the five BCM neurons (bottom).

stops when misclassification error falls to zero. On the other hand, the BCM method does not try to reduce the misclassification error and is able to find a richer, linguistically meaningful structure, containing burst locations and format tracking of the three different stops that allowed a better generalization to other speakers and to voiced stops. The network and its training paradigm present a different approach to speaker independent speech recognition. In this approach the speaker variability problem is addressed by training a network that concentrates mainly on the distinguishing features of a single speaker, as opposed to training a network that concentrates on both the distinguishing and common features, on multispeaker data.

106

Nathan Intrator

Acknowledgments

I wish to thank Leon N. Cooper for suggesting the problem and for providing many helpful hints and insights. Geoff Hinton made invaluable comments. The application of BCM to speech is discussed in more detail in Seebach (1991) and in a forthcoming article (Seebach and Intrator, in press). Charles Bachmann assisted in running the backpropagation experiments. Research was supported by the National Science Foundation, the Army Research Office, and the Office of Naval Research.

References Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In Computing Science and Statistics: Proc. 20th Symp. Interface, E. Wegman, ed., pp. 192-203. American Statistical Association, Washington,

Dc. Bear, M. F., and Cooper, L. N 1988. Molecular mechanisms for synaptic modification in the visual cortex: Interaction between theory and experiment. In Neuroscience and Connectionist Theory, M. Gluck and D. Rumelhart, eds., pp. 65-94. Lawrence Erlbaum, Hillsdale, NJ. Bear, M. F., Cooper, L. N, and Ebner, F. F. 1987. A physiological basis for a theory of synapse modification. Science 237, 42-48. Bellman, R. E. 1961. Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bienenstock, E. L., Cooper, L. N, and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. I. Neurosci. 2, 3248. Clothiaux, E. E., Cooper, L. N, and Bear, M. F. 1991. Synaptic plasticity in visual cortex: Comparison of theory with experiment. I. Neurophysiol. To appear. Cooper, L. N, and Scofield, C. L. 1988. Mean-field theory of a neural network. Proc. Natt. Acad. Sci. U S A . 85, 1973-1977. Diaconis, P., and Freedman, D. 1984. Asymptotics of graphical projection pursuit. Ann. Statist. 12, 793-815. Duda, R. O., and Hart, l? E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Edelman, S., and Biilthoff, H. H. 1991. Canonical views and the representation of novel three-dimensional objects. To appear. Friedman, J. H. 1987. Exploratory projection pursuit. J. Arner. Statist. Assoc. 82, 249-266. Friedman, J. H., and Tukey, J. W. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C(23), 881-889. Geman, S., Bienenstock, E., and Doursat, R. 1991. Neural networks and the bias-variance dilemma. To appear. Hinton, G. E., and Nowlan, S. J. 1990. The bootstrap Widrow-Hoff rule as a cluster-formation algorithm. Neural Comp. 2(3), 355-362.

Feature Extraction Using an Unsupervised Neural Network

107

Huber, P. J. 1985. Projection pursuit (with discussion). Ann. Statist. 13, 435-475. Intrator, N., 1990. Feature extraction using an unsupervised neural network. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Ellman, T. J. Sejnowski, and G. E. Hinton, eds., pp. 310-318. Morgan Kaufmann, San Mateo, CA. Intrator, N., and Cooper, L. N 1991. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks. To appear. Intrator, N., and Gold, J. I. 1991. Three-dimensional object recognition of gray level images: The usefulness of distinguishing features. To appear. Intrator, N., and Tajchman, G. 1991. Supervised and unsupervised feature extraction from a cochlear model for speech recognition. In Neural Networks for Signal Processing - Proceedings of the 1992 IEEE Workshop, B. H. Juang, S. Y. Kung, and C. A. Kamm, eds., pp. 460-469. Kruskal, J. B. 1972. Linear transformation of multivariate data to reveal clustering. In Multidimensional Scaling: The0ry and Application in the Behavioral Sciences, I , Theory, R. N. Shepard, A. K. Romney, and S. 8.Nerlove, eds., pp. 179-191. Seminar Press, New York and London. Lieberman, P., and Blumstein, S. E. 1988. Speech Physiology, Speech Perception, and Acoustic Phonetics. Cambridge University Press, Cambridge. Linsker, R. 1988. Self-organization in a perceptual network. IEEE. Comp. 88, 105-117. Lippmann, R. P. 1989. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Miller, K. D. 1988. Correlation-based models of neural development. In Neuroscience and Connectionist Theory, M. Gluck and D. Rumelhart, eds., pp. 267353. Lawrence Erlbaum, Hillsdale, NJ. Oja, E. 1982. A simplified neuron model as a principal component analyzer. Math. Biol. 15, 267-273. Reilly, D. L., Scofield, C. L., Cooper, L. N, and Elbaum, C. 1988. Gensep: A multiple neural network with modifiable network topology. INNS Conf. Neural Networks. Seebach, 8. S. 1991. Evidence for the development of phonetic property detectors in a neural net without innate knowledge of linguistic structure. Ph.D. dissertation, Brown University. Seebach, B. S., and Intrator, N. A neural net model of perinatal inductive acquisition of phonetic features. Sejnowski, T. J. 1977. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. von der Malsburg, C. 1973. Self-organization of orientation sensitivity cells in the striate cortex. Kybernetik 14,85-100. Zwicker, E. 1961. Subdivision of the audible frequency range into critical bands (frequenzgruppen). J. Acoust. SOC. Am. 33(2): 248.

Received 18 March 1991; accepted 20 May 1991.

This article has been cited by: 2. Q.Q. Huynh, L.N. Cooper, N. Intrator, H. Shouval. 1998. Classification of underwater mammals using feature extraction based on time-frequency analysis and BCM theory. IEEE Transactions on Signal Processing 46:5, 1202-1207. [CrossRef] 3. Y. Dotan, N. Intrator. 1998. Multimodality exploration by an unsupervised projection pursuit neural network. IEEE Transactions on Neural Networks 9:3, 464-472. [CrossRef] 4. Robert L. Goldstone. 1998. PERCEPTUAL LEARNING. Annual Review of Psychology 49:1, 585-612. [CrossRef] 5. Harel Shouval, Nathan Intrator, C. Charles Law, Leon N Cooper. 1996. Effect of Binocular Cortical Misalignment on Ocular Dominance and Orientation SelectivityEffect of Binocular Cortical Misalignment on Ocular Dominance and Orientation Selectivity. Neural Computation 8:5, 1021-1040. [Abstract] [PDF] [PDF Plus] 6. Suzanna Becker, Mark Plumbley. 1996. Unsupervised neural network learning procedures for feature extraction and classification. Applied Intelligence 6:3, 185-203. [CrossRef] 7. Ying Zhao, C.G. Atkeson. 1996. Implementing projection pursuit learning. IEEE Transactions on Neural Networks 7:2, 362-373. [CrossRef] 8. David J. Field . 1994. What Is the Goal of Sensory Coding?What Is the Goal of Sensory Coding?. Neural Computation 6:4, 559-601. [Abstract] [PDF] [PDF Plus] 9. Shimon Edelman. 1993. Representing three-dimensional objects by sets of activities of receptive fields. Biological Cybernetics 70:1, 37-45. [CrossRef] 10. Nathan Intrator . 1993. Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural NetworksCombining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks. Neural Computation 5:3, 443-455. [Abstract] [PDF] [PDF Plus] 11. Nathan Intrator , Joshua I. Gold . 1993. Three-Dimensional Object Recognition Using an Unsupervised BCM Network: The Usefulness of Distinguishing FeaturesThree-Dimensional Object Recognition Using an Unsupervised BCM Network: The Usefulness of Distinguishing Features. Neural Computation 5:1, 61-74. [Abstract] [PDF] [PDF Plus]

Communicated by Alex Waibel

Speaker-Independent Digit Recognition Using a Neural Network with Time-Delayed Connections K. P. Unnikrishnan" Molecular Biophysics Research Department, AT&T Bell Laboratories, Murray Hill, N] 07974 USA

J. J. Hopfield Molecular Biophysics Research Department, AT&T Bell Laboratories, Murray Hill, N] 07974 U S A and Divisions of Chemistry and Biology, California Institute of Technology, Pasadena, C A 91125 U S A D. W. Tank Molecular Biophysics Research Department, AT&T Bell Laboratories, Murray Hill, N ] 07974 USA

The capability of a small neural network to perform speaker-independent recognition of spoken digits in connected speech has been investigated. The network uses time delays to organize rapidly changing outputs of symbol detectors over the time scale of a word. The network is data driven and unclocked. To achieve useful accuracy in a speakerindependent setting, many new ideas and procedures were developed. These include improving the feature detectors, self-recognition of word ends, reduction in network size, and dividing speakers into natural classes. Quantitative experiments based on Texas Instruments (TI) digit data bases are described.

1 Introduction Accurate recognition of spoken words in connected speech is difficult to achieve with limited computational resources. A "neural network approach using time delays to organize the incoming signal into a recognizable form was constructed in earlier work, and studied in detail for *Present address: Computer Science Department, GM Research Laboratories, Warren, MI 48090-9055 USA. Neural Computation 4,108-119 (1992) @ 1992 Massachusetts Institute of Technology

Speaker-IndependentDigit Recognition

109

the case of a single speaker (Unnikrishnan et al. 1988,1991). The case of a single speaker is, however, notoriously easier than speaker-independent word recognition, and is of rather limited utility in the world of engineering compared to the case of speaker independence. The present paper studies time-delay networks for speaker-independent connected speech. The problem of identifying the spoken digits 0-9 in connected speech was chosen because it is well defined, small enough to study in detail, and has an established data base used as a standard for intercomparisons of results (Leonard 1984). This data base is sufficiently diverse that adequate performance on it is believed to be sufficient for field use in the United States. In addition, this particular problem is sufficiently important that a compact, low cost, and low power-consumption solution to it would be commercially useful. The multiple-speaker problem is much more difficult than the singlespeaker case, and its adequate solution demands many additional ideas and methods not present in our previous studies. Based on how well the original network performed on a speaker-dependent data base, we set out to examine whether a small number of networks could be used in parallel to solve the more difficult speaker-independent problem. Each subnetwork would be optimized on a separate cluster of data, for example, males or females. Because it is simple to train networks that make few mistakes of erroneous recognition, parallel use of multiple networks is a feasible approach to the general problem. In the course of these studies we found that even when the data were clustered into a few simpler problems, recognition accuracy was inadequate. Changes were therefore made to improve network performance. The most important of these changes are improved front-end signal processing for more reliable generation of invariant features from the input speech, reduction in the size of the network to favor generalization over memorization in the learning process, using the network itself to recognize what to learn, automatic segmentation of spoken digits from multiword strings, and explorations of dividing speakers into natural classes to simplify the problem faced by a single network. In this paper we describe the performance of the various networks and approaches, presenting critical experiments for deciding to incorporate or abandon particular ideas and structures in the overall scheme. These results are described approximately in the order in which they were obtained. They begin with the obvious: using the same network that had proved successful for the single-speaker problem on the multiple-speaker data base. They conclude with experiments on a much-improved network and a data base of male speakers only (having found along the way that a single network shares with the simple hidden Markov model (HMM) performance at only a moderate level when men and women are placed together in the data base). The size and complexity of the networks simulated are such that an analog CMOS implementation would require less than a square centimeter.

K. P. Unnikrishnan, J. J. Hopfield, and D. W. Tank

110

2 Network Architecture and Learning Algorithm The conceptual design of the word-recognition neural network is sketched in Figure 1. Details of the architecture and the learning algorithm have been described elsewhere (Unnikrishnan et al. 1991) and here we give only a very brief summary of it.

w

v

*

BANDPASS FILTERS FEATURE M s\;o; DETECTORS

$oqi"->

z 2 !=w zk az

03

B a

" V

"<9>

Figure 1: Block diagram of the speech recognition circuit with 32 bandpass filters. The feature detectors use a "center-surround" mechanism and the responses are generated by comparing outputs of neighboring filters. In some of the experiments, a different front end was used. It consisted of 15 bandpass filters and a zero crossing detector. The zero crossing detector uses raw speech waveform. The time delays used in the network have gaussian profiles with different widths. The connection matrix is learned using a learning algorithm and the rest of the circuit is fixed.

Speaker-IndependentDigit Recognition

111

The analog speech waveform first goes to a bank of parallel analog frequency band filters. The rectified output from this filter bank is sent to a feature detector network that identifies the presence of short time scale features by a competitive comparison of the outputs of the filter bank. This procedure converts the original single-channel analog signal to multiple channel binary outputs crudely describing the frequency location of three significant peaks in the power spectrum. These outputs change much more slowly in time than the input signal does, and provide a suitable pattern to be recognized by neural network approaches. The multiple channel signal is sent down parallel tapped dispersive delay lines. The recognition unit for a particular word is connected by analog "weights" to these delay lines and the outputs of these nonlinear units signify recognition of the completion of particular words. The weights in the network are learned using an algorithm that minimizes a mutual discrimination error measure (see Hopfield 1987 and Unnikrishnan et al. 1991 for details of the learning algorithm). By storing the delay information at discrete time intervals, the learning problem for a particular recognition unit can be reduced to a single layer of weights in an analog network with many inputs and a single output. The learning rule uses the analog response of the output units as a representation of the probability of a word having been said. For a given set of data, gradient descent leads to a unique set of weights for this learning rule and network structure. To compensate for the temporal distortions, we have used dispersive time delays in the network (see also Tank and Hopfield 1987). These delays have gaussian profiles with different widths. In addition, each recognition unit also has an integration time constant. The summed signal from all delays is filtered with this time constant at the input of each recognition unit. Hence there are two parameters that determine the temporal tolerance of the network: (1) the width of the time delays ([T) and (2) the integration time constant of recognition units ( ~ ~ ~ 1 . 3 Data Base and Scoring Protocols

All the results reported here are based on two spoken digit data bases from TI. The TI isolated digit data base used consists of two utterances of each digit by 112 speakers, a regionally balanced mixture containing both men and women. These files were divided into a training set containing 50 speakers and a test set of 62 speakers. There is an appreciable variance in the distribution of utterance lengths. The average fractional time distortion [(longest - shortest)/average] for the training data was about 92%, but individual cases were as high as 157%. The TI connected digit data base contains 330 speakers, a balanced mixture including children as well as men and women. There are two examples from each speaker of each individual digit, 11 examples from

112

K. P. Unnikrishnan, J. J. Hopfield, and D. W. Tank

each speaker of spoken digit pairs, and strings of up to 7 digits (Leonard 1984). The experimental results for any particular set of data, network structure, and connections can be described by the percentage of correctly identified digits. Two measures were used to evaluate performance and provide recognition accuracy scores. The first is the threshold score: according to this measure, a recognition is scored as correct only if the output of the correct recognition unit was above a threshold value near the end of the utterance, with all the incorrect recognition units remaining below this threshold throughout the utterance. The second performance measure is the area score: according to this measure, a recognition is scored as correct if the time integrated output of the correct recognition unit over the period of the utterance is larger than the integrated output of any of the incorrect units. The threshold criterion for recognition is required for word spotting (recognition of individual words independent of the context in which they occur). The area criterion requires segmentation of data for recognition and is analogous to the scoring procedure used in HMM models (Levinson et al. 1983). We include recognition accuracies according to the area criterion for comparison with the results of other groups. Real time recognition of words will be impossible with the area criterion without cumbersome algorithmic additions to an otherwise simple network structure. In the models that use such a criterion, the recognition is usually done by waiting until the end of the sequence and then doing multiple matches with respect to the number of words spoken and possible candidates. The threshold criterion is a more strict measure of performance of the network than the area criterion and hence in all cases the threshold score is lower. In the following text and table we give the recognition accuracy with the area criterion outside the parentheses and the accuracy with the threshold criterion within parentheses. 4 Results

It was previously demonstrated (Unnikrishnan et al. 1991) that on a single-speaker connected-digit data base, a network with learned timedelayed connectionshad a performance similar to that provided by HMM digit recognition systems. When trained on 288 utterances of the digits, the network was able to learn all of the training data (recognition accuracy - 100% with threshold and area criteria). It could recognize a test set of 144 utterances with an accuracy of 100% (99.3% with threshold criterion). To evaluate the extent to which this same network and training algorithm could solve the speaker-independent isolated digit recognition problem, it was trained on 500 utterances from the TI isolated digit data base. This data base contains a mixture of males, females, and children. The recognition accuracy on the training set was 98.6% [(91.4%

Speaker-Independent Digit Recognition

113

Table 1: Recognition Accuracy of Training a n d Testing Data with Various Network Configurations and Data Sets.'

Recognition accuracy ( W ) Row

Training data

Testing data

a

98.6 (91.4)

81.5 (61.8)

b

99.8 (98.4)

92.0 (78.0)

C

99.6 (96.1)

d

(98.5)

e

97.6 (90.0)

f

100 (99.5)

g

99.8 (93.7)

h

99.9 (95.2)

95.6 (81.1)

i

99.6 (93.5)

97.5 (84.9)

j

98.0 (83.3)

95.5 (75.4)

k

(92.6)

(82.4)

98.3 (92.6)

Comments 32 input channels, isolated digit data base, 50-speaker training set, 62-speaker test set As in case (a) but learning with selfjustification Learning on the combined sets of (a), 112 speakers As in (c) but with 15 frequency channels and the "unvoiced" channel Trained on 309 speakers, one example of each digit from each speaker Train on one utterance of 110 males, test on other utterances of same males Trained on 2090 one-word segments from two-digit strings of 110 males Train on 1056 segments from two-digit strings of 55 males marked as training set, tested on 1037 segments from other 55 males As in (h), but adding an additional 544 segments from three-digit strings of same speakers to training data; test data are same as in (h) As in (i) but adding 1100 isolated word files of same speakers to training data Recognition accuracy for male connected speech; trained on segments from connected speech set, tested on strings from the test set

"Recognition accuracy using area criteria is given outside the parentheses and the accuracy using threshold criteria is given within the parentheses. Rows a 4 contain recognition results on the TI isolated-digit data base and rows e-k contain results on the TI connected-digit data base. All results in rows e-k are using a front end with 15 frequency channels and an "unvoiced channel. See text for more details.

with threshold criterion); row a, Table I]. It recognized an independent test set (different speakers) of 620 utterances with an accuracy of 81.5% [(61.8%); row a, Table 11. These scores indicate that the circuitry and the learning paradigm as was used in the single speaker case was not sufficient for reliable recognition using the multiple-speaker data base. 4.1 Time-Duration Clustering. A series of experiments were done to determine the effects of temporal distortions on the network performance. In the first set of experiments, the data base was split into two clusters

114

K. P. Unnikrishnan, J. J. Hopfield, and D. W. Tank

(one containing the shorter utterances and the other containing the longer utterances) and separate networks trained on each one of them. In the next set of experiments, the time delays were made rigid. These did not change the network performance drastically, suggesting that for this data base, most of the difficulty may be due to variance in the frequency domain. 4.2 Frequency Clustering. Using a network trained on the entire data base, the files were split into two clusters: one containing high frequency utterances and the other containing low frequency utterances. Networks were able to learn the utterances in these clusters to better accuracies than those from an unbiased group taken from the same total data set. Also, a network trained on one cluster recognized test sets from the other cluster very poorly. These results demonstrate that spectral variance in acoustic features contribute substantially to the limited performance of the speech recognition network. We therefore adopted the premise that any complete recognition system would have two separate networks devoted to different frequency clusters and focused on improving the accuracy of a network for the male speaker data base subset. 4.3 Self-Justification. The speech examples were end-pointed by hand for use in the supervised learning procedure. But an analysis of the network outputs after learning showed that for many of the examples, the maximum response of the correct recognition unit was not at the assigned end point. This suggested that to generate optimal networks, the output of the network itself should be used for determining the end point of words. To accomplish this, the network was partially trained with the hand-justified end points. The time point of maximum response for the correct recognition unit, with these partially trained connections, was taken as the new end point and the training continued. This procedure implements self-justification of a11 the examples. This led to much better recognition of the training and testing data (compare row b with row a in Table l), decreasing the error rate by about a factor of two on the independent test set.

4.4 Data Limitations. The TI isolated word multiple speaker data base was studied with the 32-channel front-end described in Unnikrishnan et al. (1991). Row b in Table 1 shows the results for training sets and test sets of approximately equal size. The excellent recognition score of (98.7%)according to the stringent threshold criterion on the training set was not matched by the score of (78%) on the test set. This discrepancy indicates that the system is to some extent memorizing the voices of the particular 50 speakers in the training set, and that the ensemble used for training is not large enough.

Speaker-IndependentDigit Recognition

115

To examine whether the network is in principle capable of correctly classifying all the data at once, it was trained on all the data, comprising all 112 speakers and 1120 speech files. Row c shows the results. By the area criterion, the classification was near perfect, and with the threshold criterion, the performance was at the 96%level. This result suggested that to be able to both train and test the system, more speakers and more data per connection would be necessary. It also suggested that the network was near the end of its capability, and that some improvements would prove necessary to reach high levels of performance on an adequately broad data set. The system requires too much data because it has too many connections to train. A typical word detector requires about 7 x 32 + 1 = 225 connections (on the average 7 time delays for each one of the 32 input channels and a bias value). While many more speech files than this are available, the similarity between different speakers or files is such that the amount of training data available is not adequately large. To alleviate this problem, we reduced the network size to 16 channels, with typically 7 x 16 + 1 = 113 connections per digit. 4.5 Zero-Crossing Detectors. The original 32 frequency band "front end" followed by a feature detector network was designed to locate peaks in the power spectrum and does not distinguish very well between vowel and consonant sounds or between voiced and unvoiced speech. A detector was designed to distinguish between voiced and unvoiced speech and used as one of the channels in a reduced 16 channel front end. A variety of methods can do this with high reliability. We chose a method based on zero crossings of the raw wave form as a method that would be easy to implement in analog hardware, and relatively independent of the intensity normalization of the sound. An impulse was generated at the time of each upward zero crossing of the raw speech signal. These impulses were filtered with an integration time constant of 0.005 sec. The unvoicing channel was turned on if the output of this filter corresponded to a zero crossing rate which was above 2000 crossings per second, and if the total power in the speech was above a threshold level. The output of this channel located unvoiced consonants x, s, f, v, and t in the data set with excellent reliability. Further explorations in this paper have all been based on a 16-channel system containing the zero crossing detector. The other 15 channels are frequency channels of the previous type (see Unnikrishnan ef al. 1991 for details), but having twice the frequency bandwidth. These channels were centered at the locations of the previous even-numbered channels 2-30. The feature detector network was modified slightly to prevent the identification of a peak in two adjacent frequency channels. The replacement of the 32-channel front end by the 16-channel system described above resulted in better performance on the entire 112 speaker data base (compare rows c and d in Table 1). Confronted with the necessity of

116

K. P.Unnikrishnan, J. J. Hopfield, and D. W. Tank

obtaining more data, and desiring to move toward connected speech, we began working with the much larger TI data base for connected speech. 4.6 Separating Males and Females. The TI connected-digit data base available to us contains a regionally balanced set of 309 speakers, including men, women, and children. When trained on one utterance of each speaker on each of the 10 digits spoken as an isolated word (3090 files), a relatively poor performance level [97.6%(90.0%);row e, Table 11 on the training set was achieved. Clearly the speech variation is now greater than the network can encompass. One major difference between the present data base and the previous one is the inclusion of children. Following the partitioning idea described earlier, we split the data base into two portions, males and nonmales. For the male training set of isolated words from the connected digit data base (consisting of one example of each digit spoken by all 110 speakers) the network could be trained to a high level of performance on the training set [loo% (99.5%);row f, Table 11. The poorer performance on the test set from the same speakers (row f, Table 1)indicates that there is still an inadequate number of speech files in the training set. However, more data were now available from strings of two and three digits.

4.7 Automatic Segmentation of Training Data. To obtain individual words necessary for training from digit strings without a large amount of hand segmentation, a bootstrap procedure was employed. To begin, the recognition network with the connections learned from one utterance of each digit from the 110 male speakers was used to label the ends of words in the connected speech files. The recognition score was 100% [(99.5%); row f, Table 11 on the training set and 98.3% [(92.6%);row f, Table 11 on the test set. These connections were then used to segment individual digits from two-digit strings. The system could now be trained on this larger data base. By iteration, the total training set size was ultimately increased to 2090 utterances. This training set could be recognized with an accuracy of 99.8% [(93.7%);row g, Table 11. Since the performance is lower by the threshold criteria than that obtained with the isolated digits database (row f, Table 11, we surmise that the recognition of segmented digits from strings is a harder problem than working with isolated words. The two obvious differences between this data base and the isolated word data base are the larger variation in the lengths of utterances and wordword coarticulation effects. This enlarged data set was split into training and test sets (55 speakers for training and 55 speakers for testing) yielding 1056 segmented words for training and 1037 words for testing. The network could be trained to recognize the test set with an accuracy of 99.9% [(95.2%);row h, Table 11. The test set was recognized with an accuracy of 95.6% [(81.1%);row h, Table 11. The fact that training set could be learned very well and not

Speaker-IndependentDigit Recognition

117

the test set shows that the total number of files in the training set is still small. We proceeded to increase the size of the training set by segmenting digits from three-digit strings. Adding segments from three-digit strings yielded a total of 2600 training words. The network could be trained to recognize this training set with an accuracy of 99.6% [(93.5%);row i, Table 11 and to recognize the test data (same data set as in the previous case) with an accuracy of 97.5% 1(84.9%); row i, Table 11. But while the addition of new words to the training data increases the recognition accuracy for a test set, the continued poorer performance on the test set compared to the training set shows that there is still inadequate training data. An experiment was tried in which the isolated digits were added to the segmented connected digits training data. The recognition score on the training set was reduced to 98% [(83.3%); row j, Table 11 and the score on test set was reduced to 95.5% [(75.4%);row j, Table 11. The isolated digits typically have much longer duration than segmented digits from strings. The resultant additional variance in length is probably the cause of the reduced recognition accuracy. 4.8 Recognition of Strings. The experiments described above were done on isolated digits or digits segmented out from strings. We tested the performance of the network mentioned in row i of Table 1 on unsegmented connected digit strings from which the segmented test set had been previously produced. Some readjustment of delay-line parameters and integration time constants was necessary to eliminate the inhibitory signals from previous digits preventing recognition of a current digit. Such a network was able to recognize the training data with an threshold criteria accuracy of (92.4%) (row k, Table 1) and test data with a threshold accuracy of (82.4%) (row k, Table 1). We did not write the more complex software to do scoring for continuous speech by the area criterion, since this is not the desired ultimate use of the network. But by analogy with other experiments (cf, row i, Table 1) we would anticipate a recognition accuracy by the area criteria of approximately 99% on the training data set and 97% on the test data set. More conventional approaches to this problem, involving extensive high-speed digital processing and a combinatorial examination of locations for word boundaries, have been carried out on this data base by many others. Using a network that uses acoustic-phonetic features, Bush and Kopec (1987) achieved an accuracy of 96%. Rabiner et al. (1988) achieved an accuracy of 97.1% using a hidden Markov model. Our network can easily be implemented using low precision and low power analog circuitry. The connections would require only two bits of precision and an algebraic sign and the network has been shown to tolerate a considerable amount of noise (Unnikrishnan et al. 1991). While the experiments are not strictly comparable (the earlier work is on string recognition, and we have not made a complete study of all strings of all

118

K. P. Unnikrishnan, J. J. Hopfield, and D. W. Tank

nonchildren), the difference between them is comparable to that expected between threshold and area criteria within our studies. This indicates that the two approaches are extracting similar amounts of information from the sound signal (though not necessarily the same information), and that the major addition of the HMM procedure is to be able to work somewhat better with words of great length variability through massive computation. The direct neural network approaches to this problem are to use multiple or hierarchical time-scale networks (see also Waibel et al. 1989).

5 Conclusions We believe that the time-delay networks of this style we have studied are likely to be able to solve the speaker-independent digit recognition problem at a useful engineering level, with a neural network small enough to fit onto a single very low power analog VLSI chip. Even if four sets of connections (two sets of time delays and two sets of voice qualities) are needed, the total number of connections required is less than 6000. The four networks would share a common front end and delay network. The Intel 80170NW electrically trainable analog neural network chip based on eeprom technology already has 10,000 adjustable analog connections. The experiments we have described permit us to delineate the remaining problems and possible ways to solve them. First, the front-end needs some improvement. The very large increase in performance produced by the inclusion of a voicing detector is an indication that the substitution of one or two frequency filters by more clever feature detectors would be of enormous help. Even the frequency filters themselves are not optimal. The output of the filter bank often lacks a formant trajectory when that trajectory is clearly visible in a windowed FFT power spectrum. The variability of our front-end output compared to that of the WAVES program (Entropic Speech Inc.) suggests that better filters alone would be of considerable help. Second, the amount of available data in the TI data set is inadequate for the learning procedure used in the present study. It is, for example, responsible for the large difference in recognition accuracy between the test and the training set illustrated in row k of Table 1. A modified learning procedure that can capture outliers and generalizes better could be adopted, or alternatively a brute force approach of using a larger data set. For example, a variance model could be used in conjunction with the training set to effectively enlarge it. Third, speaker clusters should be produced by the networks directly. In the experiments described here, training data were clustered using males and nonmales as predefined categories. This would make each cluster more compact and simplify the problem.

Speaker-Independent Digit Recognition

119

Fourth, when connected speech and isolated words are combined in a single data base, the difference in the duration of a given word within the data now begins to matter. This problem can be circumvented by dividing the data into fast and slow categories by clustering as illustrated in the text, and training a network for each cluster. These networks could be run in parallel, since false recognitions are generally not a problem. The output of the best network can then be used for recognition. The alternative use of a hierarchy of two delay time scales is also attractive. Acknowledgments The TI connected digit data base was provided by the National Bureau of Standards. We wish to thank David Talkin for providing us the WAVES program and the Speech Research Department at Bell Labs for computer support. The work of J. J. H. at Caltech was supported in part by Office of Naval Research (Contract No. N00014-87-K-0377). References Bush, M. A., and Kopec, G. E. 1987. Network-based connected digit recognition. IEEE Trans. ASSP 35, 1401-1413. Hopfield, J. J. 1987. Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Natl. Acad. Sci. U S A . 84, 8429-8433. Leonard, G. E. 1984. A database for speaker-independent digit recognition. Proc. lntl. Conf. Acoustics Speech Signal Process. 3, 42.11.1-42.11.4. Levinson, S. E., Rabiner, L. R., and Sondhi, M. M. 1983. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Tech. J. 62, 1035-1074. Rabiner, L. R., Wilpon, J. G., and Soong, F. K. 1988. High performance connected digit recognition using hidden Markov models. Proc. lntl. Conf. Acoustics Speech Signal Process. S3.6, 119-122. Tank, D. W., and Hopfield, J. J. 1987. Concentrating information in time: Analog neural networks with applications to speech recognition problems. Proc. I E E E First lntl. Conf. Neural Networks, San Diego, CA. Unnikrishnan, K. P., Hopfield, J. J., and Tank, D. W. 1988. Learning Timedelayed connections in a speech recognition circuit. Abstr. Neural Networks Compuf. Conf., Snowbird, UT. Unnikrishnan, K. I?, Hopfield, J. J., and Tank, D. W. 1991. Connected-digit speaker-dependent speech recognition using a neural network with timedelayed connections. IEEE Transact. Signal Proc. 39, 698-713. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP 37, 328339. Received 11 February 1991; accepted 15 July 1991.

This article has been cited by: 2. B. K. Szymanski, G. G. Chen. 2007. Computing with Time: From Neural Networks to Sensor Networks. The Computer Journal 51:4, 511-522. [CrossRef]

Communicated by David Zipser

Local Feedback Multilayered Networks Paolo Frasconi Marco Gori Giovanni Soda Dipartimento di Sistemi e lnformatica, Via di S . Marta, 3-50139 Firenze, Italy In this paper, we investigate the capabilities of local feedback multilayered networks, a particular class of recurrent networks, in which feedback connections are only allowed from neurons to themselves. In this class, learning can be accomplished by an algorithm that is local in both space and time. We describe the limits and properties of these networks and give some insights on their use for solving practical problems. 1 Introduction

Recurrent networks have recently been investigated by several researchers (Bourlard 1989; Cleeremans et al. 1989; Elman 1988; Gherrity 1989; Gori et al. 1989; Pearlmutter 1989; Pineda 1987; Watrous 1988; Williams and Zipser 1989) because of these networks’ potential capability of coping with sequential tasks. The problem of learning has received particular attention. A review on this subject has recently been proposed by Pearlmutter (1990). He presents the basic contributions on the subject by distinguishing algorithms for full recurrent networks from those used for local feedback networks. In spite of the efforts made for discovering learning algorithms, insufficient attention has been paid to finding out which networks are suited for which problems. Such an investigation should provide design criteria for selecting architectures that are welltailored for solving different classes of problems. Instead of learning with huge full recurrent architectures, it is better to get rid of unnecessary connections because the network‘s generalization to new examples is likely to increase significantly. This paper is a first attempt to gain some knowledge concerning the behavior of particular recurrent networks, referred to as local feedback multilayered neural networks (Gori 1990). Only local feedback connections (self-loops) are assumed, and, therefore, the resulting architecture is very similar to Mozer’s focused backpropagation (Mozer 1988). These networks were mainly conceived for phoneme recognition problems in the attempt of capturing the dynamic nature of speech (Gori et al. 1989). Neural Computation 4, 120-130 (1992)

@ 1992 Massachusetts Institute of Technology

Local Feedback Multilayered Networks

121

The dramatic constraint on feedback connections make learning possible by means of an algorithm that is local in both space and time, and that has the same complexity as backpropagation (BE'). It is proven that these networks are particularly suited for exhibiting a forgetting behavior, which is a very desirable feature for a task like phoneme recognition. They can latch events that occur arbitrarily far in time, but they are very poor for dealing with arbitrarily Iong sequences. For example, a problem as easy as implementing a counter cannot be learned by a local feedback MLN. 2 Local Feedback MLNs and BPS'

The network class we consider relies on an MLN architecture in which some neurons exhibit dynamic behavior. The following definition sketches the hypotheses we assume. Definition 1. (Local Feedback MLNs) 1. The network has an MLN architecture. Static and dynamic neurons are considered. For convenience, we group neurons in the following sets: Z

1 1 1=I

input neuron set;

1x1= m

Ft = hidden neuron set; U = output neuron set; 2) = dynamic

IU1 = n

neuron set; D

C

7l U 0

where I . I denotes cardinality; 2. Static neurons perform a weighted sum of their inputs

a;(t)= C W j j X j ( t ) i

(2.1)

3. The activation of the dynamic neurons follows one of the equations Uj(t)

=

w;;a;(t - 1)

+ CWjjXj(f) i

a;(t) =

(local activation feedback MLN) WjjXj(t - 1) X W j j X j ( t )

+

(2.2)

i

(local output feedback MLN)

(2.3)

ieD only connections from network inputs can be accepted (jE 1) 'BPS was first derived in a slightly different form in Gori et al. 1989.

(2.4)

P. Frasconi, M. Gori, and G. Soda

122

Output layer

n

n

S e l f - 1oop c o n n e c t i o n s

Dynamic layer

Input layer

Figure 1: An example of local feedback MLN with dynamic neurons in the first hidden layer.

4. Each neuron has a sigmoidal output function f(a) A tanh(a/2). 5. The learning environment is composed of a sequence of frames and the supervision can be arbitrarily placed along the sequence. The error C with respect to the targets is evaluated according to (2.5) where 7 is the set of frame indexes, di(t) represents the desired output for unit i E 0, when feeding the network with frame t, and of is a binary switch that defines at which time t E 7 supervision takes place on output neurons. Figure 1 shows a typical example of local feedback MLN, in which all dynamic neurons are placed in the hidden layer (i.e., D = 7-0. Such an arrangement of dynamic neurons has been successfully employed for phoneme recognition tasks (Gori et al. 1989). Local feedback MLN can be derived from continuous models similar to that of Pearlmutter. A detailed analysis on discrete approximation of continuous recurrent nets, which enlightens the role of the model's parameters, can be found in Tsung (1990). It is worth mentioning that in our model, the term w;; of equations 2.2 and 2.3 provides the unit with a time constant which is responsible for dynamic behaviors. Learning is accomplished by means of a gradient descent algorithm. The previous hypotheses, particularly those concerning the architecture, make it possible to compute the gradient by means of an algorithm,

Local Feedback Multilayered Networks

123

called backpropagation for sequences ( B E ) (Gori et al. 1989), which is local in both space and in time. This represents a significant advantage in using this kind of architecture. A similar algorithm was independently discovered by Mozer (1988) for his focused architecture.

Theorem 1. (on Learning in Local Feedback MLNs). In local feedback MLNs the following facts are true:

The gradient of cost (2.5) can be computed by (2.6)

where y , ( t ) needs to be computed only if superuision takes place at time t. The backward factor yi(t) can be computed by using the ordinary backpropagation backward step. If wij is a static weight, then the fonuard factor zi,(t) can be computed in exactly the same way as BP. For all weights connected to dynamic neurons, one of the following relationships holds:

+ Xj(t)(l- 6ij) + 6,Xj(t - 1) (output feedback) (2.7) wiizij(t - I ) + xj(t)(I - Sij) + 6ijaj(t - I ) (activation feedback) (2.8)

Zij(t)

= Wiif"Ui(t - l)]Z,j(t- 1)

zij(t)

=

with zjj(t = 0 ) = 0 6 , = 1 iff i = j

i E D , j E 1.

(2.9)

Proof. First, let us consider static weights. The architectural hypotheses imply that static neurons cannot feed dynamic ones and thus the path from each input to each output is entirely static, exactly as in BP. As a result, the forward term q ( t ) is simply reduced to xj(t). Second, for dynamic weights, 2.6 holds because of hypothesis 2.4. When a dynamic weight changes, the activation of the corresponding neuron also changes according to the forward factor of 2.6, whereas the change in the cost is related to the path connecting this neuron to its outputs; this path, according to 2.4, is a static one. It follows that yi(t) can be computed by using BP's ordinary backward step. Finally, by taking the weight derivatives of 2.2 and 2.3, we obtain 2.7 and 2.8. 0 It is worth mentioning that a learning procedure based on similar results can be also derived by assuming that the forward connections carry out multiple delays (Gori 1989).

P. Frasconi, M. Gori, and G. Soda

124

3 Forgetting Behavior

Definition 2 (Forgetting Behavior). We say a recurrent network exhibits forgetting behavior if the following relationship holds:

When adopting local activation feedback MLNs, quite a simple result is that forgetting is strictly related to the stability of each neuron (Gori 1990). Moreover, this architecture turns out to have a canonical form, because it is equivalent to all possible networks having full connection in a given context layer (Frasconi et al. 1990). If output feedback is assumed, then a similar conclusion can be drawn concerning forgetting. Theorem 2 (Forgetting Behavior). Local output feedbuck MLNs exhibit forgetting behavior, provided that

lwiij < l / d where d

V i E 2,

= maxlf‘(a)]=

(3.2)

1/2.

Proof? Let us define (3.3) i E O,kE 3-1,j E Z

By using the chain rule, it is quite simple to prove the following: (3.4) k€U

Now let us define the static contribution h(t)

CWk,Xj(t)

(3.5)

1EZ

for each neuron k E 3-1. By using the chain rule again, we can compute $ i j ( t ) for each dynamic neuron as follows: (3.6)

Since f ’ ( . ) E (O,d], it can be proved by induction on 9 that the following inequality holds: (3.7) ’For the sake of simplicity, the proof is limited to networks having just one hidden layer, but it can be easily extended to general local feedback MLNs.

Local Feedback Multilayered Networks

125

and thus, by using 3.4: (3.8)

lim$(f)

4-00

=0

(3.9)

These results suggest that the class of network we consider is very well suited for all applications in which the network output has to take relatively recent information into account. Successful results have been obtained by using this model for phoneme recognition, in which a forgetting behavior turns out to be very useful (Gori ef d . 1989). 4 Learning Sequences

There are several situations in which we want to recognize patterns composed of sequences. Tasks of this kind occur, for example, in automatic speech recognition for recognizing words. Let us distinguish two cases, depending on whether or not the maximum sequence length is fixed in advance. Moreover, let us assume that the sequences are supervised only at the end. In other words, the class is established only when the whole sequence has been scanned by the network. 4.1 Fixed Length Sequences. Let us assume a local activation feedback architecture with two hidden layers. The first one is composed of dynamic neurons that extract a distinctive feature from each sequence. The second one is an ordinary static layer that is necessary to perform universal interpolation from the previous feature vector and the output. The learning environment is composed of sequences of fixed length (maximum length = TI. It can be proved (Frasconi et al. 1991) that these sequences can be learned with arbitrary null cost by a network having at least IT dynamic neurons in the first hidden layer and enough static neurons in the second hidden layer. We can observe that the required number of dynamic hidden neurons is the same number of inputs that would be necessary to model the sequence as a static pattern. This represents a dramatic practical limitation that cuts down the advantages with respect to feedforward architectures. However, since local feedback networks are likely to achieve better generalization in these cases, they may still be more advantageous.

4.2 Information Latching. The limitation on the maximum sequence length may sometimes represent a serious restriction. A sequence can be arbitrarily long, thus requiring a discrimination rule depending on arbitrarily past information. In these cases, local activation feedback MLNs

I? Frasconi, M. Gori, and G. Soda

126

are surely to fail because of their relaxation at the unique equilibrium state. Moreover, with finite-length sequences too, it may turn out to be better to use other recurrent architectures. The result outlined in the previous paragraph also suggests investigating the effect of adopting output instead of activation feedback. This kind of limitation derives from the network's linear dynamic behavior. It is interesting to see if further improvement could be attained by exploiting the non-linear dynamic behavior typical of output feedback. Definition 3 (Information Latching). We say that a given dynamic hidden neuron latches information if its free evolution (null inputs) satisfies the following relationship: sign[a(t)]= sign[a(to)]

Vt > t o

(4.1)

This definition suggests the interpretation of the neuron output as a boolean status, in that only the sign of xi(t) is relevant and not its absolute value.

Theorem 3. Given a generic dynamic hidden neuron, information latching (for null inputs) occurs provided that (wI > 2. The latching condition also holds if the forcing term of 2.3 is bounded in module by a constant B = w f ( & )- &, where
-[1

2

-f(<)]2

- 1=0

(4.2)

Proof. Let us consider the free evolution of a generic dynamic hidden neuron 4t

+ 1) = wfb(01 = g[a(t)l

(4.3)

If we assume IwI > 2, equation 4.3 has two equilibrium points. We now prove the asymptotical stability of the positive solution a: that satisfies the relationships wf(a,+)

a,+

=

a,+

> 0

(4.4)

Let us define the Lyapunov function V(a)as

V ( a )= ( a - u$)2

(4.5)

We have AV

'2(,

=

V[g(a)]- V(a)= [g(a) - ~

=

[g(a)

+ a - &$][g(a)

$ -1 (a~ - a t ) *

- a]

is the weight of the local feedback connection of the neuron.

(4.6)

Local Feedback Multilayered Networks

If a 2 g(a) then a 2 a$ and positive and consequently

g(6)

127

2 a:. As a result the first factor is

AVlO (4.7) Conversely, if 0 < a 5 g(a) then a 5 a: and g(a) 5 a:, and therefore, the first factor is negative and we have AV 5 0 again. It follows the stability of a: for each a E (0,m). A similar proof can be provided for the stability of the other point a;. It can be easily proved that the inequality 4.7 is also valid if g(u) is translated by a constant b such that (bl < B, where B is defined by equation 4.2. Now let us consider the effect of adding a time-variant forcing term b(t) bounded in module by a constant bo such that 0 < bo 5 B. As previously done, let us limit the analysis to the positive solution. From the previous discussion, it follows that the system a(t

+ I ) = wa(t)

bo

(4.8) has a stable equilibrium point a:. We can easily prove that the activation a ( t ) of the system a(t

+ 1) =

-

+ b(t)

ZUU(~)

(4.9)

satisfies the inequality 4t)24 t )

(4.10)

By assuming null initial state, equation 4.10 is obviously valid for t = 0. Let us suppose that 4.10 is valid at t; then: a(t

+ 1)

wa(t)+ b ( t ) 2 W a ( t ) + b ( t ) = (Y(t + 1)+ bo + b(f) 2 a(t + I ) =

(4.11)

Because of the previous considerations on the stability of a: (see equation 4.81, the activation a(t),and therefore a(t), cannot change their sign, and, 0 by definition 3, information latching occurs.

Example 1. Let us consider the problem of recognizing arbitrarily long strings of characters. In particular, let us assume that the alphabet is composed of five symbols (A, B, C, D, E), and that the class is decided just by taking the first string symbol. We want to see if a local feedback MLN is able to learn this rule. We adopted a network with five inputs and exclusive symbol coding, three dynamic hidden neurons with output feedback, and five outputs, with exclusive symbol coding. The learning environment was composed of 500 sequences (100 for each class). The network learned the task quickly by discovering a coding of the five classes by means of the three dynamic hidden neurons. In particular, for each sequence, the network was able to discover automatically the only significant symbol - the first one - and to latch all the related information until the end of the occurred sequence. All patterns of the learning environment were recognized.

128

P. Frasconi, M. Gori, and G. Soda

4.3 General Sequences. At this point, we must address an important question related to the behavior of a local output feedback MLN in tasks such as classification of arbitrary sequences. Information latching is an indispensable property for dealing with this class of problems, but it is not sufficient. The real weakness of local feedback architectures regards problems in which keeping track of the order inside a sequence is required. For the sake of simplicity, we restrict our discussion to a particular class of sequences, in which the problem of recognition is very similar to the problem of parsing a string of symbols.

Definition 4 (Event Sequences). We say that a sequence of symbols4 { S ( T ) , T E 7, S ( T ) E A } ( Ais a given finite alphabet) is an event sequence (ES) if the following conditions hold: 1. A can be partitioned into two sets R (relevant set) and N (nonrelevant set), according to a given criterion; symbols belonging to the relevant set are referred to as events.

2. An arbitrary number of symbols belonging to N can be interposed between two consecutive symbols belonging to R. In the following, if a given dynamic hidden neuron performs information latching, we assume that symbols of R (N) can (cannot) modify the boolean status of the neuron. This can be accomplished in local feedback architectures under some simple hypotheses on ”separability” of symbols. The previous definition refers to network configurations in which the dynamic hidden neurons perform information latching. On the other hand, it is clear that, without this assumption, local feedback MLNs cannot deal with sequences of any length. When using ES the nonrelevant symbols do not affect net evolution. Conversely, the relevant symbols draw the status of each dynamic hidden neuron. Local feedback MLNs are not suited for dealing with ES, as shown in the following simple example.

Example 2. Let us consider the problem of counting the number of occurrences of a given event e in an ES for which R = {e}. It is quite simple to show that this problem cannot be solved by any local feedback MLN. Because of condition 2 of ES, we begin by pointing out that a network that does not perform information latching obviously cannot solve a problem like this. Let us assume that one or more dynamic hidden neurons latch event e. The first occurrence of the event changes the boolean state of these neurons. However, whenever the event e comes out again, this state cannot be changed. As a result, our net cannot deal with multiple occurrences of the same event. 4T0 feed the networks, these symbols must be represented as vectors of R’.

Local Feedback Multilayered Networks

129

The example proposed is not a particular case in which local feedback MLNs fail. It can be shown that they also fail in situations in which the ordering of the sequence must be taken into account.

5 Conclusions The analyses reported in this paper indicate that local feedback MLNs are particularly suited for problems in which classification depends on relatively recent information because of their capability of exhibiting forgetting effect. Our networks also behave quite well in problems of sequence recognition, provided that the sequences involved are relatively short. When dealing with long sequences, and particularly with sequences in which no limitation can be assumed concerning their length, local feedback MLNs exhibit serious limitations. A problem as easy as counting the occurrences of a given event cannot be accomplished in sequences with arbitrary length. As a matter of fact, for all these situations different recurrent architecture must be used. In these cases, learning algorithms like the ones suggested by Williams and Zipser (1989) and Pearlmutter (1989) must be adopted for learning. Although they are not local in both space and in time, they allow us to learn in full recurrent networks.

Acknowledgments This research was partially supported by M U S T 40% and CNR Grant 90.01530.CT01. We thank Renato De Mori and Yoshua Bengio of the School of Computer Science, McGill University, Montreal, Canada for their contribution on some ideas reported in this paper.

References Bourlard, H., and Wellekens, C. 1990. Links between hidden Markov models and multilayer perceptrons. IEEE Transact. Pattern Anal. Machine Intelligence PAMI-12, 1167-1178. Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. 1989. Finite state automata and simple recurrent networks. Neural Cornp. 3, 372-381. Elman, J. L. 1988. Finding structure in time. CRL Tech. Rep. 8801. La Jolla: University of California, San Diego, Center for Research in Language. Frasconi, P., Gori, M., and Soda, G. 1990. Recurrent networks with activation feedback. Proc. 3rd ltalian Workshop Parallel Architectures Neural Networks, Vietri sul Mare, Salerno, 15-18 May 1990, pp. 329-336. Frasconi, P., Gori, M., and Soda, G. 1991. Local feedback rnultilayered networks. Tech. Rep. RT2/91, Dipartimento di Sistemi e Informatica, Universith di Firenze. Gherrity, M. 1989. A learning algorithm for analog, fully recurrent neural networks. Proc. IEEE-lJCNN89 I, 643-644, Washington DC,June 18-22, 1989.

130

P. Frasconi, M. Gori, and G. Soda

Gori, M. 1989. An extension of BPS. Proc. Neuro-Nimes '89,83-93, Nimes, 13-16 November 1989, France. Gori, M. 1990. Apprendimento con supervisione in reti neuronali. Ph.D. Thesis, Universiti di Bologna, Italy. Gori, M., Bengio, Y., and De Mori, R. 1989. BPS: A learning algorithm for capturing the dynamical nature of speech. Proc. IEEE-IJCNN89 11, 417-423, Washington DC,June 18-22, 1989. Mozer, M. C . 1988. A focused back-propagation algorithm for temporal pattern recognition. Tech. Rep. CRG-TR-88-3, University of Toronto. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 2,263-269. Pearlmutter, B. A. 1990. Two new learning procedures for recurrent networks. Neural Networks Rev. 3, 99-101. Pineda, F. J. 1987. Generalization of back-propagation in recurrent networks. Physical Rtu. Lett. 29, 2229-2232. Tsung, F. 5. 1990. Learning in recurrent finite difference networks. In Proceedings of the 2990 Summer School, D. s. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, eds., pp. 123-130. Morgan Kaufmann, San Mateo, CA. Watrous, R. L., Ladendorf, B., and Kuhn, G. 1988. Complete gradient optimization of a recurrent network applied to /b/, / d / , /g/ discrimination. 1. Acoust. SOC.Am. 87, 1301-1309. Williams R. J., and Zipser D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280.

Received 4 March 1991; accepted 25 April 1991.

This article has been cited by: 2. Krzysztof Patan. 2010. Local stability conditions for discrete-time cascade locally recurrent neural networks. International Journal of Applied Mathematics and Computer Science 20:1, 23-34. [CrossRef] 3. Khurshid M. Kiani, Terry L. Kastens. 2008. Testing Forecast Accuracy of Foreign Exchange Rates: Predictions from Feed Forward and Various Recurrent Neural Network Architectures. Computational Economics 32:4, 383-406. [CrossRef] 4. Mohammad Karamouz, Saman Razavi, Shahab Araghinejad. 2008. Long-lead seasonal rainfall forecasting using time-delay recurrent neural networks: a case study. Hydrological Processes 22:2, 229-241. [CrossRef] 5. Paris Mastorocostas, Dimitris Varsamis, Constantinos Hilas, Constantinos Mastorocostas. 2007. A generalized Takagi–Sugeno–Kang recurrent fuzzy-neural filter for adaptive noise cancelation. Neural Computing and Applications . [CrossRef] 6. Aydogan Savran. 2007. Multifeedback-Layer Neural Network. IEEE Transactions on Neural Networks 18:2, 373-384. [CrossRef] 7. T.G. Barbounis, J.B. Theocharis, M.C. Alexiadis, P.S. Dokopoulos. 2006. Long-Term Wind Speed and Power Forecasting Using Local Recurrent Neural Network Models. IEEE Transactions on Energy Conversion 21:1, 273-284. [CrossRef] 8. Alex Aussem . 2002. Sufficient Conditions for Error Backflow Convergence in Dynamical Recurrent Neural NetworksSufficient Conditions for Error Backflow Convergence in Dynamical Recurrent Neural Networks. Neural Computation 14:8, 1907-1927. [Abstract] [PDF] [PDF Plus] 9. P.A. Mastorocostas, J.B. Theocharis. 2002. A recurrent fuzzy-neural model for dynamic system identification. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:2, 176-190. [CrossRef] 10. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 11. P. Campolucci, F. Piazza. 2000. Intrinsic stability-control method for recursive filters and neural networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47:8, 797-802. [CrossRef] 12. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 12:1, 126-140. [CrossRef] 13. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef]

14. S.C. Kremer. 1999. Identification of a specific limitation on local-feedback recurrent networks acting as Mealy-Moore machines. IEEE Transactions on Neural Networks 10:2, 433-438. [CrossRef] 15. Jie Zhang, A.J. Morris. 1999. Recurrent neuro-fuzzy networks for nonlinear process modeling. IEEE Transactions on Neural Networks 10:2, 313-326. [CrossRef] 16. H.T. Siegelmann, B.G. Horne, C.L. Giles. 1997. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:2, 208-215. [CrossRef] 17. D. Bonnet, V. Labouisse, A. Grumbach. 1997. δ-NARMA neural networks: a new approach to signal prediction. IEEE Transactions on Signal Processing 45:11, 2799-2810. [CrossRef] 18. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 19. Y. Bengio, P. Frasconi. 1996. Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks 7:5, 1231-1249. [CrossRef] 20. P. Frasconi, M. Gori. 1996. Computational capabilities of local-feedback recurrent networks acting as finite-state machines. IEEE Transactions on Neural Networks 7:6, 1521-1525. [CrossRef] 21. Tsungnan Lin, B.G. Horne, P. Tino, C.L. Giles. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks 7:6, 1329-1338. [CrossRef] 22. P. Frasconi, M. Gori, M. Maggini, G. Soda. 1995. Unified integration of explicit knowledge and learning by example in recurrent networks. IEEE Transactions on Knowledge and Data Engineering 7:2, 340-346. [CrossRef] 23. Paul Bressloff, John Taylor. 1993. Spatiotemporal pattern processing in a compartmental-model neuron. Physical Review E 47:4, 2899-2912. [CrossRef]

Communicated by Fernando Pineda

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks Jiirgen Schmidhuber' lnstitut f i r Informatik, Technische Universitat Miinchen, Arcisstr. 21, 8000 Miinchen 2, Germany

Previous algorithms for supervised sequence Iearning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly. The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived. Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding. 1 The Task A training sequence p with np discrete time steps (called an episode) consists of np ordered pairs [xf'(t),dP(t)] E R" x R" ,0 < t 5 np. At time t of episode p a learning system receives x P ( t ) as an input and produces the output yP(t). The goal of the learning system is to minimize

where d;(t) is the ith of the rn components of dP(t), and $ ( t ) is the ith of the m components of yP(t). In general, this task requires storage of input events in a short-term memory. Previous solutions to this problem have employed gradientbased dynamic recurrent nets (e.g., Robinson and Fallside 1987; Pearlmutter 1989; Williams and Zipser 1989). In the next section an alternative gradient-based approach is described. For convenience, we drop the indices p that stand for the various episodes. The gradient of the error over all episodes is equal to the sum of the gradients for each episode. Thus we only require a method for minimiz*Current address: Department of Computer Science, University of Colorado, Campus Box 430, Boulder, CO 80309 USA.

Neural Computation 4,131-139 (1992) @ 1992 Massachusetts Institute of Technology

Jiirgen Schmidhuber

132

ing the error observed during one particular episode:

E

=

CE(t) t

where E ( t ) = ci[di(t) - yi(t)12. [In the practical on-line version of the algorithm below there will be no episode boundaries; one episode will 'Mend into the next (Williams and Zipser 1989).] 2 The Architecture and the Algorithm

The basic idea is to use a slowly learning feedforward network S (with a set of randomly initialized weights WS) whose input at time t is the vector x ( t ) and whose output is transformed into immediate weight changes for a second "fast-weight" network F. The input to F at time t is also x ( t ) , its m-dimensional output is yff), and its set of weight variables is W,. F serves as a short-term memory: At different time steps, the same input event may be processed in different ways depending on the time-varying state of W,. The standard method for processing temporal sequences is to employ a recurrent net with feedback connections. The feedback connections allow for a short-term memory of information earlier in a sequence. The present work suggests a novel approach to building a short-term memory by employing fast weights that can be set and reset by the "memory controller" S. Fast weights can hold on to information over time because they remain essentially invariant unless they are explicitly modified. One potential advantage of the method over the more conventional recurrent net algorithms is that it does not necessarily require full-fledged units - experiencing some sort of feedback - for storing temporal information. A single weight may be sufficient. Because there are many more weights than units in most networks, this property represents a potential for storage efficiency. For related reasons, the novel representation of past inputs is well-suited for solving certain problems involving temporary variable binding in a natural manner: F's current input may be viewed as a representation of the addresses of a set of variables; F's current output may be viewed as the representation of the current contents of this set of variables. In contrast with recurrent nets, temporary bindings can be established very naturally by temporary connectivity pat terns instead of temporary activation patterns (see Section 3.2 for an illustrative experiment). For initialization reasons we introduce an additional time step 0 at the beginning of an episode. At time step 0 each weight variable w,b E W, of a directed connection from unit a to unit b is set to Ow,b(O) (a function of S's outputs as described below). At time step t > 0, the w,b(t - 1) are used to compute the output of F according to the usual activation spreading rules for backpropagation networks (e.g., Werbos 1974). After

133

Fast-Weight Memories

this, each weight variable Wab E W, is altered according to W u b ( f ) = g[Wub(t-

(2.1)

OWub(t)]

where ~7 (e.g., a sum-and-squash function) is differentiable with respect to all its parameters and where the activations of S’s output units (again computed according to the usual activation spreading rules for backpropagation networks) serve to compute OWub(t) by a mechanism specified below. O W & ( f ) is s’s contribution to the modification of wab at time step t. Equation 2.1 is essentially identical to Moller and Thrun’s equation 1 in Moller and Thrun (1990). Unlike Moller and Thrun (19901, however, the current paper derives an exact gradient descent algorithm for timevarying inputs and outputs for this kind of architecture. For all weights w, E WS (from unit i to unit j ) we are interested in the increment

Here 77 is a constant learning rate. At each time step t > 0, the factor

can be computed by conventional backpropagation (e.g., Werbos 1974). For t > 0 we obtain the recursion aWub(t)

- -

h i j

- a c [ W u b ( t - 11, OWub(t)]awub(t - 1) awub(t - 1) aw, - I), nWub(t)] aowab(t) + ag[Wub(t a‘wab(t) h i j

We can employ a method similar to the one described in Robinson and Fallside (1987) and Williams and Zipser (1989): For each wub E WF and each wij E W, we introduce a variable $’ (initialized to zero at the beginning of an episode) that can be updated at each time step t > 0:

(2.3)

aOwub(t)/dwijdepends on the interface between S and F. With a given interface (two possibilities are given below) an appropriate backpropagation procedure for each wab E WF gives us dOw,b(t)/aw, for all wv E Ws.

Jiirgen Schmidhuber

134

After having updated the f i b variables, (2.2) can be computed using the formula

A simple interface between S and F would provide one output unit sob

E S for each weight variable Wab E W,, where

being the output unit’s activation at time t 2 0. A disadvantage of 2.4 is that the number of output units in S grows in proportion to the number of weights in F. An alternative is the following: Provide an output unit in S for each unit in F from which at least one fast weight originates. Call the set of these output units FROM. Provide an output unit in S for each unit in F to which at least one fast weight leads. Call the set of these output units TO. For each weight variable wab E WF we now have a unit s, E FROM and a unit sb E TO. At time t, define OW,b(t) := g[s,(t), S b ( t ) ] , where g is differentiable with respect to all its parameters. As a representative example we will focus on the special case of g being the multiplication operator:

sab(t)

Owab(t) := % ( f ) S b ( f )

(2.5)

Here the fast weights in F are manipulated by the outputs of S in a Hebb-like manner, assuming that u is just a sum-and-squash function as employed in the experiments described below. One way to interpret the FROMITO architecture is to view S as a device for creating temorary associations by giving two parameters to the short-term memory: The first parameter is an activation pattern over FROM representing a key to a temporary association pair; the second parameter is an activation pattern over TO representing the corresponding entry. Note that both key and entry may involve hidden units. Equations 2.4 and 2.5 differ in the way that error signals are obtained at S’s output units: If 2.4 is employed, then we use conventional backpropagation to compute a s & ( t ) / h q in 2.3. If 2.5 is employed, note that

Conventional backpropagation can be used to compute as, ( t ) / a W i j for each output unit a and for all wij. The results can be kept in I WS I * I FROMU TO I variables. This makes it easy to solve 2.6 in a second pass. The algorithm is local in time, its update-complexity per time step is O(l WF 1) W, I). However, it is not local in space (see Schmidhuber 1990b for a definition of locality in space and time).

Fast-Weight Memories

135

2.1 On-Line versus Off-Line Learning. The off-line version of the algorithm would wait for the end of an episode to compute the final change of W, as the sum of all changes computed at each time step. The on-line version changes W Sat every time step, assuming that 7 is small enough to avoid instabilities (Williams and Zipser 1989). An interesting property of the on-line version is that we do not have to specify episode boundaries [“all episodes blend into each other” (Williams and Zipser 1989)l. 2.2 Unfolding in Time. An alternative of the method above would be to employ a method similar to the ”unfolding in time” algorithm for recurrent nets (e.g., Rumelhart et al. 1986). It is convenient to keep an activation stack for each unit in S. At each time step of an episode, some unit’s new activation should be pushed onto its stack. S’s output units should have an additional stack for storing sums of error signals received over time. With both 2.4 and 2.5, at each time step we essentially propagate the error signals obtained at S’s output units down to the input units. The final weight change of W Sis proportional to the sum of all contributions of all errors observed during one episode. The complete gradient for S is computed at the end of each episode by successively popping off the stacks of error signals and activations analogously to the ”unfolding in time” algorithm for recurrent networks. A disadvantage of the method is that it is not local in time. 2.3 Limitations and Extensions. When both F and S are feedforward networks, the technique proposed above is limited to only certain types of time-varying behavior. With CJ being a sum-and-squash function, the only kind of interesting time-varying output that can be produced is in response to variations in the input; in particular, autonomous dynamic behavior like oscillations (e.g., Williams and Zipser 1989) cannot be performed while the input is held fixed. It is straightforward to extend the system above to the case where both S and F are recurrent. In the experiment below S and F are nonrecurrent, mainly to demonstrate that even a feedforward system employing the principles above can solve certain tasks that only recurrent nets were supposed to solve. The method can be accelerated by a procedure analogous to the one presented in Schmidhuber (1992). 3 Experiments The following experiments were conducted in collaboration with Klaus Bermer, a student at Technische Universitat Miinchen.

136

Jiirgen Schmidhuber

3.1 An Experiment with Unknown Time Delays. In this experiment, the system was presented with a continuous stream of input events and F's task was to switch on the single output unit the first time an event " B occurred following an event "A." At all other times, the output unit was to be switched off. This is the flip-flop task described in Williams and Zipser (1989). One difficulty with this task is that there can be arbitrary time lags between relevant events. An additional difficulty is that no information about "episode boundaries" is given. The on-line method was employed: The activations of the networks were never reset. Thus, activations caused by events from past "episodes" could have a harmful effect on activations and weights in later episodes. Both F and S had the topology of standard feedforward perceptrons. F had 3 input units for 3 possible events "A," "B," and "C." Events were represented in a local manner: At a given time, a randomly chosen input unit was activated with a value of 1.0; the others were deactivated. F's output was one-dimensional. S also had 3 input units for the possible events "A," "B," and "C," as well as 3 output units, one for each fast weight of F. Neither of the networks needed hidden units for this task. The activation function of all output units was the identity function. The weight-modification function (1) for the fast weights was given by

Here T determines the maximal steepness of the logistic function used to bound the fast weights between 0 and 1. The weights of S were randomly initialized between -0.1 and 0.1. The task was considered to be solved if for 100 time steps in a row F's error did not exceed 0.05. With fast-weight changes based on 2.4, T = 10 and q = 1.0 the system learned to solve the task within 300 time steps. With fast-weight changes based on the FROM/TO architecture and 2.5, T = 10 and 17 = 0.5 the system learned to solve the task within 800 time steps. The typical solution to this problem has the following properties: When an A-signal occurs, S responds by producing a large weight on the B input line of F (which is otherwise small), thus enabling the F network as a B detector. When a B signal occurs, S "resets" F by causing the weight on the B line in F to become small again, thereby making F unresponsive to further B signals until the next A is received.

3.2 Learning Temporary Variable Binding. Some researchers have claimed that neural nets are incapable of performing variable binding. Others, however, have argued for the potential usefulness of "dynamic links" (e.g., von der Malsburg 19811, which may be useful for variable

Fast-Weight Memories

137

binding. With the fast-weight method, it is possible to train a system to use fast weights as dynamic links in order to temporarily bind variable contents to variable names (or “fillers” to ”slots”)as long as it is necessary for solving a particular task. In the simple experiment described next, the system learns to remember where in a parking lot a car has been left. This involves binding a value in a variable that represents the car’s location. Neither F nor S needed hidden units for this task. The activation function of all output units was the identity function. All inputs to the system were binary, as were F‘s desired outputs. F had one input unit which stood for the name of the variable WHERE-IS-MY-CAR? In addition, F had three output units for the names of three possible parking slots PI, P2, and P3 (the possible answers to WHERE-IS-MY-CAR?). S had three output units, one for each fast weight, and six input units. (Note that S need not always have the same input as F.) Three of the six input units were called the parking-slot detectors - I], 12, 4.These detectors were activated for one time step when the car was parked in a given slot (while the other slot-detectors remained switched off). The three additional input units were randomly activated with binary values at each time step. These random activations served as distracting time varying inputs from the environment of a car owner whose life looks like this: He drives his car around for zero or more time steps (at each time step the probability that he stops driving is 0.25). Then he parks his car in one of three possible slots. Then he conducts business outside the car for zero or more time steps during which all parking-slot-detectors are switched off again (at each time step the probability that he finishes business is 0.25). Then he remembers where he has parked his car, goes to the corresponding slot, enters his car, and starts driving again, etc. Our system focused on the problem of remembering the position of the car. It was trained by activating the WHERE-IS-MY-CAR? unit at randomly chosen time steps and by providing the desired output for F, which was the activation of the unit corresponding to the current slot Pi, as long as the car was parked in one of the three slots. The weights of S were randomly initialized between -0.1 and 0.1. The task was considered to be solved if for 100 time steps in a row F‘s error did not exceed 0.05. The on-line version (without episode boundaries) was employed. With the weight-modification function 3.1, fastweight changes based on 2.4, T = 10 and q = 0.02 the system learned to solve the task within 6000 time steps. As it was expected, S learned to “bind” parking slot units to the WHERE-IS-MY-CAR? unit by means of strong temporary fast-weight connections. Due to the local output representation, the binding patterns were easy to understand: At a given time there was a large fast weight on the connection leading from the WHERE-IS-MY-CAR? unit to the appropriate parking slot unit (given the car was currently parked). The other fast weights remained temporarily suppressed.

138

Jurgen Schmidhuber

4 Concluding Remarks

The system described above is a special case of a more general class of adaptive systems (which also includes conventional recurrent nets) that employ some parameterized memory function for changing a vectorvalued memory structure and that employ some parameterized retrieval function for processing the contents of the memory structure and the current input. The only requirement is that the memory and retrieval functions be differentiable with respect to their internal parameters. Such systems work because of the existence of the chain rule. Results as above [as well as other novel applications of the chain rule (Schmidhuber 1990a, 1991)l indicate that there may be additional interesting (yet undiscovered) ways of applying the chain rule for temporal credit assignment in adaptive systems. Acknowledgments

I wish to thank Klaus Bergner for conducting the experiments. Furthermore I wish to thank Mike Mozer, Bernhard Schatz, and Jost Bernasch for providing comments on a draft of this paper. References Moller, K., and Thrun, S. 1990. Task modularization by network modulation. In Proceedings of Neuro-Nimes ‘90, J. Rault, ed.,pp. 419432. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Robinson, A. J., and Fallside, F. 1987. The utility driven dynamic error propagation network. Tech. Rep. CUED/F-INFENG/TR.l, Cambridge University Engineering Department. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Schmidhuber,J. H. 1990a. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem. Dissertation, Institut fur Informatik, Technische Universitat Miinchen. Schmidhuber, J. H. 1990b. Learning algorithms for networks with internal and external feedback. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, eds., pp. 52-61. Morgan Kaufmann, San Mateo, CA. Schmidhuber,J. H. 1991. Learning to generate sub-goals for action sequences. In Proceedings of the International Conference on Artificial Neural Networks K A N N 91, T. Kohonen, K. Makisara, 0.Simula, and J. Kangas, eds., pp. 967972. Elsevier Science Publishers B.V., Amsterdam.

Fast-Weight Memories

139

Schmidhuber, J. H. 1992. A fixed size storage O(n3)time complexity learning algorithm for fully recurrent continually running networks. Neural Cornp., in press. von der Malsburg, C. 1981. Internal Report 81-2, Abteilung fiir Neurobiologie, Max-Planck Institut fiir Biophysik und Chemie, Gottingen. Werbos, I? J. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University. Williams, R. J., and Zipser, D. 1989. Experimental analysis of the real-time recurrent learning algorithm. Connect. Sci. 1(1), 87-111.

Received 4 April 1991; accepted 18 July 1991.

This article has been cited by: 2. Andrew D. Back , Tianping Chen . 2002. Universal Approximation of Multiple Nonlinear Operators by Neural NetworksUniversal Approximation of Multiple Nonlinear Operators by Neural Networks. Neural Computation 14:11, 2561-2566. [Abstract] [PDF] [PDF Plus] 3. Hiroyuki Nakahara* , Kenji Doya . 1998. Near-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed BehaviorNear-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed Behavior. Neural Computation 10:1, 113-132. [Abstract] [PDF] [PDF Plus]

Communicated by Yann LeCun

REVIEW

First- and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method Roberto Battiti Dipartimento di Matematica, Universita di Trento, 38050 Povo (Trento),Italy On-line first-order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedfomard neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations. 1 Introduction

There are cases in which learning speed is a limiting factor in the practical application of multilayer perceptrons to problems that require high accuracy in the network mapping function. In this class are applications related to system identification and nonlinear modeling, time-series prediction, navigation, manipulation, and robotics. In addition, the standard batch backpropagation (BP) method ( e g , Rumelhart and McClelland 1986) requires a selection of appropriate parameters by the user, that is mainly executed with a trial-and-error process. Since one of the competitive advantages of neural networks is the ease with which they may be applied to novel or poorly understood problems, it is essential to consider automated and robust learning methods with a good average performance on many classes of problems. This review describes some methods that have been shown to accelerate the convergence of the learning phase on a variety of problems, and suggests a possible ”taxonomy” of the different techniques based on their order (i.e., their use of first or second derivatives), space and computational requirements, and convergence properties. Some of these methods, while requiring only limited modifications of the standard BP algorithm, Neural Computation 4,141-166 (1992)

@ 1992 Massachusetts Institute of Technology

142

Roberto Battiti

yield a speed-up of very large factors’ and, furthermore, are easier to apply because they do not require the choice of critical parameters (like the learning rate) by the neural network practitioners. The presentation attempts a classification of methods derived from the literature based on their underlying theoretical frameworks, with particular emphasis on techniques that are appropriate for the supervised learning of multilayer perceptrons (MLPs). The general strategy for the supervised learning of an input-output mapping is based on combining a quickly convergent local method with a globally convergent one. Local methods are based on appropriate local models of the function to be minimized. In the following sections, first we consider the properties of methods based on a linear model (steepest descent and variations), then we consider methods based on a quadratic model (Newton’s method and approximations). 2 Backpropagation and Steepest Descent

The problem of learning an input-output mapping from a set of P examples can be transformed into the minimization of a suitably defined error function. Although different definitions of the error have been used, for concreteness we consider the “traditional” sum-of-squared-differences error function defined as (2.1)

where t,, and oprare the target and the current output values for pattern

p , respectively, and no is the number of output units. The learning procedure known as backpropagation (Rumelhart and McClelland 1986) is composed of two stages. In the first, the contributions to the gradient coming from the different patterns (dEp/8wij) are calculated ”backpropagating” the error signal. The partial contributions are then used to correct the weights, either after every pattern presentation (on-line BP), or after they are summed in order to obtain the total gradient (batch BP). Let us define as gk the gradient of the error function kk = VE(wk)]. The batch backpropagation update is a form of gradient descent defined as

while the on-line update is

‘It is not unusual to observe speed-ups of 100-1000 with respect to BP with fixed learning and momentum rates.

First- and Second-Order Methods for Learning

143

If the learning rate E tends to zero, the difference between the weight vectors Wk+,,during one epoch of the on-line method tends to be small and the step EVE~(W~+,,) induced by a particular pattern p can be approximated by cVEp(wk)(by calculating the gradient at the initial weight vector). Summing the contributions for all patterns, the movement in weight space during one epoch will be similar to the one obtained with a single batch update. However, in general the learning rate has to be large to accelerate convergence, so that the paths in weight space of the two methods differ. The on-line procedure has to be used if the patterns are not available before learning starts [see, for example, the perceptron used for adaptive equalization in Widrow and Stearns (1985)1, and a continuous adaptation to a stream of input-output signals is desired. On the contrary, if all patterns are available, collecting the total gradient information before deciding the next step can be useful in order to avoid a mutual interference of the weight changes (caused by the different patterns) that could occur for large learning rates (this effect is equivalent to a sort of noise in the true gradient direction). One of the reasons in favor of the on-line approach is that it possesses some randomness that may help in escaping from a local minimum. The objection to this is that the method may, for the same reason, miss a good local minimum, while there is the alternative method of converging to the "nearest" local minimum and using randomness2 to escape only after convergence. In addition, the on-line update may be useful when the number of patterns is so large that the errors involved in the finite precision computation of the total gradient may be comparable with the gradient itself. This effect is particularly present for analog implementations of backpropagation, but it can be controlled in digital implementations by increasing the number of bits during the gradient accumulation. The fact that many patterns possess redundant information [see, for example, the case of hand-written character recognition in LeCun (1986)l has been cited as an argument in favor of on-line BP, because many of the contributions to the gradient are similar, so that waiting for all contributions before updating can be wasteful. In other words, averaging over more examples to obtain a better estimate of the gradient does not improve the learning speed sufficiently to compensate for the additional computationaI cost of taking these patterns into account. Nonetheless, the redundancy can also be limited using batch BP, provided that learning is started with a subset of "relevant" patterns and continued after convergence by progressively increasing the example set. This method has for example been used in Kramer and SangiovanniVicentelli (1988) for the digit recognition p r ~ b l e m .Even ~ if the training 21n addition, there are good reasons why random noise may not be the best available alternative to escape from a local minimum and avoid returning to it. See, for example, the recently introduced TABU methods (Glover 1987). 31f the redundancy is clear (when for example many copies of the sume example are present) one may preprocess the example set in order to eliminate the duplication. On

144

Roberto Battiti

set is redundant, on-line BP may be slow relative to second-order methods for badly conditioned problems. The convergence of methods based on gradient descent (or approximations thereof) depends critically on the relative size of the maximum and minimum eigenvalues of the Hessian matrix [see LeCun et al. (1991) and equation 2.6 for the case of steepest descent]. This is related to the "narrow valley" effect described in Rumelhart and McClelland (1986). In addition, the batch approach lends itself to straightforward modifications using second-order information, as it will be shown in the following sections. At this point, in order not to mislead the reader in the choice of the most appropriate method for a specific application, it is useful to remember that many large-scale experiments (mainly in pattern recognitionclassification) have used the simple on-line version of BP with full satisfaction, considering both the final result and the number of iterations. In some cases, with a careful tuning of the on-line procedure the solution is reached in a very small number of epochs, that is, in a few presentations of the complete example set [see, for example, Rumelhart and McClelland (198611 and it is difficult to reach a comparable learning speed with batch techniques (Cardon ef al. 1991). Tuning operations are, for example, the choice of appropriate parameters like the learning and momentum rate (Fahlman 1988), "annealing" schedules for the learning rate (that is progressively reduced) (Malferrari et al. 19901, updating schemes based on summing the contributions of related patterns (Sejnowski and Rosenberg 1986), "small batches," "selective" corrections only if the error is larger than a threshold (that may be progressively reduced) (Vincent 1991; Allred and Kelly 19901, randomization of the sequence of pattern presentation, etc. The given references are only some examples of significant applications, out of an abundant literat~re.~ Because in these cases only approximated output values are required and the example set often is characterized by a large degree of redundancy, these two attributes could be considered as votes in favor of on-line BP, again provided that the trial-and-error phase is not too expensive. In the original formulation, the learning rate 6 was taken as a fixed parameter. Unfortunately, if the learning rate is fixed in an arbitrary way, there is no guarantee that the net will converge to a point with vanishing gradient. Nonetheless, convergence in the on-line approach can be obtained by appropriately choosing a fixed and sufficiently small learning rate. The issue of an appropriate fixed learning rate for on-line LMS learning has been investigated in the adaptive signal processing literature the contrary, if redundancy is only partial, the redundant patterns have to be presented to and learned by the network in both versions. 4The availability of many variations of the on-line technique is one of the reasons why "fafi" comparisons with the batch and second-order versions are complex. Which version has to be chosen for the comparison? If the final convergence results have been obtained after a tuning process, should the tuning times be included in the comparison?

First- and Second-Order Methods for Learning

145

[see, for example, Bingham (1988)l. The result is that the convergence of stochastic LMS is guaranteed if E < 1 / ( N G), where N is the number of parameters being optimized and A, is the largest eigenvalue of the autocorrelation function of the input.5 A detailed study of adaptive filters is presented in Widrow and Stearns (1985). The effects of the autocorrelation matrix of the inputs on the learning process (for a single linear unit) are discussed in LeCun et al. (1991). In this framework the appropriate learning rate for gradient descent is 1/Amax. These results cannot be extended to multilayer networks (with nonlinear transfer functions) in a straightforward manner, but they can be used as a starting point for useful heuristics. The convergence properties of the LMS algorithm with adaptive learning rate are presented in Luo (1991), together with a clear comparison of the LMS algorithm with stochastic gradient descent and adaptive filtering algorithms. The main result is that if the learning rate E , for the nth training cycle satisfies the two conditions: (2.4) n=l

n=l

then the sequence of weight matrices generated by the LMS algorithm (with a cyclic pattern presentation) will converge to the optimal solution (minimizing the mean-square error). But even if E is appropriately chosen so that the error decreases with a reasonable speed and oscillations are avoided, gradient descent is not always the fastest method to employ. This is not an intuitive result, because the negative gradient is the direction of fastest decrease in the error. Unfortunately, the “greed” in trying to reach the minimum along this one-dimensional direction is paid at the price that subsequent gradient directions tend to interfere, so that in a weight space with dimension N, the one-dimensional minimization process has to be repeated for a number of times that is normally much higher than N (even for simple quadratic functions).6 In the steepest descent method, the process of minimizing along successive negative gradients is described as:

where f k minimizes E ( w ~- E g k ) . If steepest descent is used to minimize a quadratic function Q(w) = cTw ;wTGw (G symmetric and positive

+

5l owe this remark to the referee. 61t is easy to show that, if exact one-dimensional optimization along the negative gradient is used, the gradient at the next step is perpendicular to the previous one. If, considering an example in two dimensions, the lines at equal error are given by elongated ellipsoids, the system, for a “general” starting point, goes from a point to the one that is tangent to the equal-error lines along the gradient, and then repeats along a perpendicular direction.

Roberto Battiti

146

definite), it can be shown that IQ(wk+i)

- Q(w*)l M

(

7)max

- rlmin

Vmax

+ Vmin ) 2

lQ(wk)

-

Q(w*)l

(2.6)

where rimax and qmln are the maximum and minimum eigenvalues of the matrix G, and wl; is the minimizer [see Luenberger (197311. If these two eigenvalues are very different, the distance from the minimum ~ a l u e is multiplied each time by a number that is close to one. The type of convergence in equation 2.6 is termed q-linear convergence. One has 9superlinear convergence if, for some sequence ck that converges to 0, one has IQ(W+l)

- Q(w*)I 5 C k I Q ( W )

-

Q(w*)I

(2.7)

Finally, wk is said to converge with q-order at least p if, for some constant c, one has

I Q ( w +I )

- Q(w*)l 5 clQ(wk) - Q(w*)lP

(2.8)

In practice q-linear convergence tends to be very slow, while q-superlinear or q-quadratic ( p = 2) convergence is eventually much faster7 In the following sections, we illustrate some techniques that can be used to ensure convergence, to avoid numerical problems related to finite-precision computation, and to accelerate the minimization process with respect to standard batch BP. 3 Conjugate Gradient Methods

Let us consider a quadratic function Q ( w ) of the type described in the previous section. We have seen that one of the difficulties in using the steepest descent method is that a one-dimensional minimization in direction a followed by a minimization in direction b does nut imply that the function is minimized on the subspace generated by a and b. Minimization along direction b may in general spoil a previous minimization along direction a (this is why the one-dimensional minimization in general has to be repeated a number of times much larger than the number of variables). On the contrary, if the directions were noninterfering and linearly independent, at the end of N steps the process would converge to the minimum of the quadratic function. The concept of noninterfering directions is at the basis of the conjugate gradient method (CG) for minimization. Two directions are mutually conjugate with respect to the matrix G if

pTGpj = O

when

i#J

(3.1)

7For example, for a q-quadratically convergent method, the number of significant digits in the solution is doubled after each iteration.

First- and Second-Order Methods for Learning

147

After minimizing in direction pi, the gradient at the minimizer will be perpendicular to p i . If a second minimization is in direction p i + l , the change of the gradient along this direction is gi+l -gi = aGp,+l (for some constant a). The matrix G is indeed the Hessian, the matrix containing the second derivatives, and in the quadratic case the model coincides with the original function. Now, if equation 3.1 is valid, this change is perpendicular to the previous direction [pT(gitl - gi) = 01, therefore the gradient at the new point remains perpendicular to pi and the previous minimization is not spoiled. While for a quadratic function the conjugate gradient method is guaranteed to converge to the minimizer in at most (N + 1) function and gradient evaluations (apart from problems caused by the finite precision), for a general function it is necessary to iterate the method until a suitable approximation to the minimizer is obtained. Let us introduce the vector Y k = &+I - gk. The first search direction p l is given by the negative gradient -gl. Then the sequence wk of approximations to the minimizer is defined in the following way: (3.2) (3.3)

where g k is the gradient, (Yk is chosen to minimize E along the search direction p k and /3k is defined by (3.4)

There are different versions of the above equation. In particular, the Polak-Ribiere choice is A = ylgk+l the Fletcher-Reeves choice is p k = g[+‘+1gk+l/gkTgk* They all coincide for a quadratic function (Shanno 1978). A major difficulty with all the above iorms is that, for a general function, the obtained directions are not necessarily descent directions and numerical instability can result. Although for a wide class of functions the traditional CG method with exact searches and exact arithmetic is superlinearly convergent, implementations of the conjugate-gradient method with finite precision computation are “nearly always linearly convergent” (Gill et al. 1981), but the number of steps is in practice much smaller than that required by steepest descent. The use of a momentum term to avoid oscillations in Rumelhart and McClelland (1986) can be considered as an approximated form of conjugate gradient. In both cases, the gradient direction is modified with a term that takes the previous direction into account, the important difference being that the parameter p in conjugate gradient is automatically defined by the algorithm, while the momentum rate has to be ”guessed” by the user. Another difficulty related to the use of a momentum term is due to the fact that there is an upper bound on the adjustment caused by the momentum. For example, if all partial derivatives are equal to 1, then the

148

Roberto Battiti

exponentially weighted sum caused by the momentum rate a converges to 1/(1 - a ) [see Jacobs (1988) for details].' Furthermore, summing the momentum term to the one proportional to the negative gradient may produce an ascent direction, so that the error increases after the weight update. Among the researchers using conjugate gradient methods for the MLP are Barnard and Cole (1988), Johansson et al. (19901, Bengio and Moore (19891, Drago and Ridella (1991), Hinton's group in Toronto, the groups at CMU, Bell Labs, etc. A version in which the one-dimensional minimization is substituted by a scaling of the step that depends on success in error reduction and goodness of a one-dimensional quadratic approximation is presented in Moller (1990) (SCG).This O ( N )scheme incorporates ideas from the "model-trust region" methods (see Section 4.3) and "safety" procedures that are absent in the CG schemes, yielding convergence results that are comparable with the OSS method described in Section 6. Some modifications of the method are presented in Williams (1991). It is worth stressing that expensive one-dimensional searches are also discouraged by current results in optimization (see Section 4.1): the search can be executed using only a couple of function and gradient evaluations. 4 Newton's Method

Newton's method can be considered as the basic local method using second-order information. It is important to stress that its practical applicability to multilayer perceptrons is hampered by the fact that it requires a calculation of the Hessian matrix (a complex and expensive task'). Nonetheless, the method is briefly illustrated because most of the "useful" second-order methods originate from it as approximations or variations. It is based on modeling the function with the first three terms of the Taylor-series expansion about the current point wc:

and solving for the step sN that brings to a point where the gradient of the model is zero: Vmc(w, + s N ) = 0. This corresponds to solving the addition, B. Pearlmutter has recently shown that momentum even if chosen "optimally" can do no better than q-linear convergence (see his poster at the NIPS 1991 conference). 9A "brute force" method to calculate H is that of using a finite-difference formula. If the gradient is available (as is the case for feedforward nets), one may use: H , = [VE(w,+ h e ) - VE(w,)]/h,, with suitable h, steps [see Dennis et al. (1981)l. Note that N + 1 grad& computations are needed, so that the method is not suggested for large networks!

First- and Second-OrderMethods for Learning

149

following linear system: V2E(w,)sN= -VE(w,) (4.2) sN is, by definition, Newton’s step (and direction). If the Hessian matrix (V2E or H, for short) is positive definite and the quadratic model is correct, one iteration is sufficient to reach the minimum. Because one iteration consists in solving the linear system in equation 4.2, the complexity of one step is O(N3),using standard methods.’O In general, if the initial point wo is sufficiently close to the minimizer w,, and V2E(w,)is positive definite, the sequence generated by repeating Newton’s algorithm converges q-quudruticully to w* [see Dennis and Schnabel (1983) for details]. Assuming that the Hessian matrix can be obtained in reasonable computing times, the main practical difficulties in applying the ”pure” Newton’s method of equation 4.2 arise when the Hessian is not positive definite, or when it is singular or ill-conditioned. If the Hessian is not positive definite (this may be the case in multilayer perceptron learning!), there is no ”natural” scaling in the problem: there are directions p k of negative curvature (i.e., such that $Hpk 5 0) that would suggest ”infinite” steps in order to minimize the model. Unfortunately, long steps increase the probability of leaving the region where the model is appropriate, producing nonsense. This behavior is not uncommon for multilayer perceptron learning: in some cases a local minimization step increases some weights by large amounts, pushing the output of the sigmoidal transfer function into the saturated region. When this happens, some second derivatives are very small and, given the finite machine precision or the approximations, the calculated Hessian will not be positive definite. Even if it is, the linear system of equation 4.2 may be seriously ill-conditioned. Modified Newton‘s methods incorporate techniques for dealing with the above problems, changing the model Hessian in order to obtain a sufficiently positive definite and non-singular matrix. It is worth observing that, although troublesome for the above reasons, the existence of directions of negative curvature may be used to continue from a saddle point where the gradient is close to zero.” While calculating the analytic gradient for multilayer perceptrons can be efficiently executed ”backpropagating” the error, calculating the Hessian is computationally complex, so that practical methods have to rely on suitable approximations. In the following sections we illustrate some modifications of Newton’s method to deal with global convergence, indefinite Hessian, and iterative approximations to the Hessian itself. In the review by White (1989) the use of appropriate modifications of Newton’s methods for learning is considered starting from a statistical perspective. ‘OA smaller upper bound f o r matrix inversion is actually O ( T I ‘ O ~see ~ ~ )Press ; et a!. (1988) for details. “Only second-order methods provide this possibility, while methods based on steepest descent are condemned to failure in this case: how many “local minima” are in reality saddle points!

Roberto Battiti

150

4.1 Globally Convergent Modifications: Line Searches. Considering second-order methods, their fast local convergence property suggests that they are used when the approximation is close to the minimizer. On the other hand, getting from an initial configuration (that may be very different from the minimizer) to a point where the local model is accurate requires some additional effort. The key idea to obtain a general-purpose successful learning algorithm is tbat of combining a fast tactical local method with a robust strategic method that assures global convergence. Because Newton's method (or its modifications when the analytic Hessian is not available) has to be used near the solution, the suggested method is that of trying the Newton step first, deciding whether the obtained point is acceptable or not, and backtracking in the last case (i.e., selecting a shorter step in the Newton direction). One of the reasons for searching the next point along the Newton direction is that this is a descent direction, that is, the value of the error is guaranteed to decrease for sufficiently small steps along that direction. It is easy to see why: because the Hessian (and therefore its inverse) is symmetric and positive definite, the directional derivative of the error is negative:

dE -(wc dX

+ AsN)= V E ( W , ) ~ =S ~- V E ( W , ) ~ H ~ ' V E ( W ,<) 0

If the analytic Hessian has to be approximated, it is essential to consider only approximations that maintain symmetry and positive definiteness, so that the one-dimensional minimization step remains unaltered. When using a line-search algorithm outside the convergence region of Newton's method, some simple prescriptions have to be satisfied in order to obtain global convergence. Let the accepted step along the search direction p . be A, the requirement that E(w,+ Acpc):. E(w,) is not sufficient: the sequence wk may not converge, or it may converge to a point where the gradient is different from zero. During each one-dimensional search the steps must decrease the error by a sufficient amount with respect to the step length (I) and they must be long enough to ensure sufficient progress (11). Furthermore, the search direction must be kept sufficiently away from being orthogonal to the gradient. A formulation of the above requirements that is frequently used in optimization is based on the work of Armijo and Goldstein [see, for example, Goldstein (1967)l. Requirement (I) and (11) become (4.3)

where

cy

is a fixed constant E ( 0 , l ) and X > 0,

V E ( w , + &p)Tp 2. B V E ( w c ) T p

(4.4)

where 0is a fixed constant E ( a ,1). The condition 0 > (1: assures that the two requirements can be simultaneously satisfied.

First- and Second-Order Methods for Learning

151

If the two above conditions are satisfied at each iteration and if the error is bounded below, the sequence wk obtained is such that limk,, VE(wk) = 0, provided that each step is kept away from orthogonality to the gradient (limk,= VE(wk)'sk/llskll # 0). This result is quite important: we are permitted to use fast approximated one-dimensional searches without losing global convergence. Recent computational tests show that methods based on fast one-dimensional searches in general require much less computational effort than methods based on sophisticated one-dimensional minimizations.I2 The line-search method suggested in Dennis and Schnabel (1983) is well suited for multilayer perceptrons (where the gradient can be obtained with limited effort during the computation of the error) and requires only a couple of error and gradient evaluations per iteration, in the average. The method is based on quadratic and cubic interpolations and designed to use in an efficient way the available information about the function to be minimized [see Dennis and Schnabel(1983) for details]. A similar method based on quadratic interpolation is presented in Battiti (1989). 4.2 Indefinite Hessians: Modified Cholesky Factorization. When the Hessian matrix in the local model introduced in equation 4.1 is not positive definite and well conditioned, equation 4.2 cannot be used without modifications. This can be explained by introducing the spectral decomposition of the Hessian (based on the availability of eigenvalues and eigenvectors), writing the matrix H as a sum of projection operators: (4.5)

where A is diagonal (All is the eigenvalue rli) and U orthonormal. It is easy to see that, if some eigenvalues are close to zero (with respect to the largest eigenvalue), the inverse matrix has eigenvalues close to infinity, a sure source of numerical problem^.'^ If one eigenvalue is negative, the quadratic model does not have a minimum, because large movements in the direction of the corresponding eigenvector decrease the error value to arbitrarily negative values. A recommended strategy for changing the model Hessian in order to avoid the above problems is that of summing to it a simple diagonal matrix of the form pcI (I being the identity matrix), so that [V2E(w,)+ p c I ] 121nsimple words: it does not pay to use a method that requires a limited number of iterations if each iteration requires a huge amount of computation. I3ln detail, the conditioning number &(A)of a matrix A is defined as l/All lIA-'JJ, where 11 11 is the matrix operator norm induced by the vector norm. The conditioning number is the ratio of the maximum to the minimum stretch induced by A and measures, among other effects, the sensitivity of the solution of a linear system to finite-precision arithmetic, that is of the order of &(A)(machine precision). K ( A )does not depend on scaling of the matrix by a fixed constant.

Roberto Battiti

152

is positive definite and safely well conditioned. A proper value for /I, can be efficiently found using the modified Cholesky factorization described in Gill et al. (1981) and the heuristics described in Dennis and Schnabel (1983). The resulting algorithm is as follows. The Cholesky factors of a positive-definite symmetric matrix can be considered as a sort of "square root" of the matrix. The original matrix M is expressed as the product LDL', where L is a unit lower triangular matrixI4 and D is a diagonal matrix with strictly positive diagonal elements. Taking the square root of the diagonal elements and forming with them the matrix D'12, the original matrix can be written as M = LD1/2D1/2LT = ITT = R'R, where L is a general lower triangular matrix, and R a general upper-triangular matrix. The Cholesky factorization can be computed in about i N 3 multiplications and additions and is characterized by good numerical stability. If the original matrix is not positive definite, the factorization can be modified in order to obtain factors L and D with all the diagonal elements in D positive and all the elements in L uniformly bounded. The obtained factorization corresponds to the differing from the original one only by a diagonal factors of a matrix matrix K with nonnegative elements:

n,

gc= LDLT = H, + K

(4.6)

where

A suitable choice of P is described in Gill et al. (1981).15 The availability of the modified Cholesky factorization is the starting point for the algorithm described in Dennis and Schnabel (1983) to find pc 2 0 such that V2E(w,) + pcI is positive definite and well conditioned. Considering the eigenspace of V2E(w,), pC has to be slightly larger than the magnitude of the most negative eigenvalue. If the matrix E is zero, pc is set to zero, otherwise it is set to the minimum of two upper bounds. One upper bound is

u1 = max {ki;} l
In fact, the magnitude of the most negative eigenvalue of VE(w,) must be less than the maximum kfi, because after summing kii to it the eigenvalue becomes positive (remember that the modified factorization produces a positive definite matrix). The other upper bound is derived from the I4A unit lower triangular matrix has the diagonal elements equal to one and all the elements above the diagonal equal to zero. 15Theprescription is p2 = max {y,t / m ,EM), where y is the largest in magnitude of the diagonal elements of H,, ( the largest in magnitude of the off-diagonal elements, and 6 , the ~ machine precision. This result is obtained by requiring that the diagonal modification is minimal (minimal norm of K ) , and that sufficiently positive-definite matrices H, are left unaltered ( K null in this case).

First- and Second-Order Methods for Learning

153

Gerschgorin circle theorem and is defined as N

where D is a positive factor needed to take the finite precision of computation into account.16 If u2 is added to the diagonal elements, the matrix becomes strictly diagonally dominant (hii - C;=l..zilhijl > 0) and therefore positive definite. The algorithm uses pc = rnin(u1, UZ}. Software routines for the Cholesky decomposition can be found in Dennis and Schnabel (1983) or in Dongarra et al. (1979). 4.3 Relations with Model-Trust Region Methods. Up to now we have considered optimization techniques based on finding a search Airection and moving by an acceptable amount in that direction ("step-lengthbased methods"). In Newton's method the direction was obtained by "multiplying" the gradient by the inverse Hessian, and the step was the full Newton step when possible (to obtain fast local convergence) or a shorter step when the global strategy required it. In Section 4.2 we described methods to modify the Hessian when this was not positive definite. Because the modification consisted in adding to the local model a term quadratic in the Step magnitude (mmodified(WcfS) = m c ( W c f S ) + p c S T s ) , one may suspect that minimizing the new model is equivalent to minimizing the original one with the constraint that the step s is not too large. Now, while in line-search algorithms the direction is maintained and only the step length is changed, there are alternative strategies based on choosing first a step length and then using the f u l l quadratic model (not just the one-dimensional one of equation 4.1) to determine the appropriate direction. These competitive methods are called "model-trust-region methods" with the idea that the model is trusted only within a region, that is updated using the experience accumulated during the search process. The above suspicion is true and there is indeed a close relationship between "trust-region" methods and "line-search" methods with diagonal modification of the Hessian. This relationship is described by the following theorem. Suppose that we are looking for the step sc that solves

1 min mc(wc+ s) = E(w,) V E ( W , ) ~ +S-sTHcs; 2 subject to IIs(I 5 6,

+

(4.8)

the above problem is solved by S ( P ) = -(Hc

+ /LI)-' VE(wC)

(4.9)

1 6 is~ 6 (maxa,- minev), CM being the smallest positive number E such that l + c and nruxev, mineu being estimates of the maximum and minimum eigenvalues.

>1

Roberto Battiti

154

for the unique p 1 0 such that the step has the maximum allowed length (~ S(.L )I = &I, unless the step with p = 0 is inside the trusted region Clls(0)ll 5 &I, in which case s(0) is the solution, equal to the Newton step. We omit the proof and the usable techniques for finding b, leaving the topics as a suggested reading for those in search of elegance and inspiration [see, for example, Dennis and Schnabel (1983)l. As a final observation, note that the diagonal modification to the Hessian is a sort of compromise between gradient descent and Newton's method: When 11 tends to zero the original Hessian is (almost) positive definite and the step tends to coincide with Newton's step; when p has to be large the diagonal addition pI tends to dominate and the step tends to one proportional to the negative gradient: ~ ( p=) -(Hc

1 + /LI)-'QE(w,) x --QE(wr) I"

There is no need to decide from the beginning about whether to use the gradient as a search direction; the algorithm takes care of selecting the direction that is appropriate for a local configuration of the error surface. While not every usable multilayer perceptron needs to have thousands of weights, it is true that this number tends to be large for some interesting applications. Furthermore, while the analytic gradient is easily obtained, the calculation of second derivatives is complex and timeconsuming. For these reasons, the methods described above, while fundamental from a theoretical standpoint, have to be simplified and approximated in suitable ways that we describe in the next two sections. 5 Secant Methods

When the Hessian is not available analytically, secant rneth~ds'~are widely used techniques for approximating the Hessian in an iterative way using only information about the gradient. In one dimension the second derivative a2E(w)/L3w2can be approximated with the slope of the secant line (therefore the term "secant") through the values of the first derivatives in two near points:

@E(W) dW2

(5.1)

In more dimensions the situation is more complex. Let the current and next point be wc and w+;defining yc = QE(w+)-QE(w,) and s, = w+ -wc; the analogous secant equation to equation 5.1 is

H+ s,

= yc

(5.2)

17Historicallythese methods were called quasi-Nmton methods. Here we follow the terminology of Dennis and Schnabel (1983), where the term quasi-Newton refers to all algorithms "derived" from Newton's method.

First- and Second-Order Methods for Learning

155

The new problem is that in more than one dimension equation 5.2 does not determine a unique H+ but leaves the freedom to choose from a (V - N)-dimensional affine subspace Q(sc,y c ) of matrices obeying equation 5.2. The new suggested strategy is that of using equation 5.2 not to determine but to update a previously available approximation. In particular, Broyden's update is based on a least change principle: find the matrix in Q(sc,yc)that is closest to the previously available matrix. This is obtained by projecting18 the matrix onto Q(s,, yc). The resulting Broyden's update is (5.3) Unfortunately, Broyden's update does not guarantee a symmetric matrix. For this reason, its use in optimization is strongly discouraged (unless we are willing to live with directions that are not descent directions, a basic requirement of line-search methods). Projecting the matrix obtained with Broyden's update onto the subspace of symmetric matrices is not enough: the new matrix may be out of Q(sc,yc).Fortunately, if the two above projections are repeated, the obtained sequence of matrices (H+),, converges to a matrix that is both symmetric and in Q(sc,yc).This is the symmetric secant update of Powell:

The Powell update is one step forward, but not the solution. In the previous sections we have shown the importance of having a symmetric and positive definite approximation to the Hessian. Now, one can prove that H+ is symmetric and positive definite if and only if H+ = f+.f: for some nonsingular matrix f+. Using this fact, one update of this kind can be derived, expressing H+ = f+ f: and using Broyden's method to derive a suitable ]+.I9 The resulting update is historically known as the Broyden, Fletcher, Coldfarb, and Shanno (BFGS) update (by Broyden et al. 1973) and is given by

(5.5) The BFGS positive definite secant update has been the most successful update in a number of studies performed during the years. The positive definite secant update converges q-superlinearly [a proof can be found I6The changes and the projections are executed using the Frobenius norm: IlHll~= the matrix is considered as a "long" vector. I9The solution exists if ycsc > 0, that is guaranteed if "accurate" line searches are performed (see Section 4.1).

(C,,, h;)''',

156

Roberto Battiti

in Broyden et al. (1973)l. It is common to take the initial matrix HOas the identity matrix (first step in the direction of the negative gradient). It is possible to update directly the Cholesky factors (Goldfarb 1976), with a total complexity of O ( P ) [see the implementation in Dennis and Schnabel (198311. Secant methods for learning in the multilayer perceptron have been used, for example, in Watrous (1987). The O ( p ) complexity of BFGS is clearly a problem for very large networks, but the method can still remain competitive if the number of examples is very large, so that the computation of the error function dominates. A comparison of various nonlinear optimization strategies can be found in Webb et al. (1988). Second-order methods in continuous time are considered in Parker (1987).

6 Closing the Gap: Second-Order Methods with 0 0 Complexity

-

One drawback of the BFGS update of equation 5.5 is that it requires storage for a matrix of size N x N and a number of calculations of order O(N2).20Although the available storage is less of a problem now than it was a decade ago [for a possible method to cope with limited storage, see, for example, Nocedal (198011, the computational problem still exists when N becomes of the order of one hundred or more. Fortunately, it is possible to kill two birds with one stone. In Battiti (1989) it is shown that it is possible to use a secant approximation with O ( N ) computing and storage time that uses second-order information. This OSS (one-step secant) method does not require the choice of critical parameters, is guaranteed to converge to a point with zero gradient, and has been shown to accelerate the learning phase by many orders of magnitude with respect to batch BP if high precision in the output values is desired (Battiti and Masulli 1990). In cases where approximated output values are sufficient, the OSS method is usually better or comparable with "fair" versions of backpropagation.*' While the term OSS should be preferred, historically OSS is a variation of what is called one-step (memory-less) Broyden-Fletcher-Goldfarb-Shanno method. In addition to reducing both the space and computational complexity of the BFGS method to O ( N ) ,this method provides a strong link between secant methods and the conjugate gradient methods described in Section 3. 2aUpdatingthe Cholesky factorization and calculating the solution are both of order

O(fl).The same order is obtained if the inverse Hessian is updated, as in equation 6.1, and the search direction is calculated by a matrix-vector product. *'The comparison with BP with fixed learning and momentum rate has little meaning: if an improper learning rate is chosen, standard BP becomes arbitrarily slow or not convergent, if parameters are chosen with a slow trial-and-error process this time should also be included in the total computing time.

First- and Second-OrderMethods for Learning

157

The derivation starts by inverting equation 5.5, obtaining the positive

definite secant update for the inverse Hessian:

(6.1) Now, there is an easy way to reduce the storage for the matrix H, : just forget the matrix and start each time from the identity I. Approximating equation 6.1 in this way, and multiplying by the gradient g, = VE(w,), the new search direction p+ becomes (6.2) where the two scalars A, and B, are the following combination of scalar products of the previously defined vectors s,, g,, and ye (last step, gradient and difference of gradients):

The search direction at the beginning of learning is taken as the negative gradient and it is useful to restart the search direction to -g, every N steps (Nbeing the number of weights in the network). It is easy to check that equation 6.2 requires only O ( N ) operations for calculating the scalar products pTg,, pryc, yrgc and yFyc. Remembering that the search direction has to be used for a fast one-dimensional search (see Section 4.1), the total computation per cycle (per epoch) required by the method is a small multiple (2-4) of that required by one cycle of gradient descent with a fixed learning rate. Now, if exact line searches are performed, equation 6.2 produces mutually conjugate directions (Shanno 1978). The difference with other forms of the conjugate gradient method is that the performance of the one-step positive definite secant update maintains the "safety" properties even when the search is executed in a small number of one-dimensional trials.= While the above method is suitable for batch learning, a proposal for a learning method usable in the on-line procedure has been presented in LeCun (1989)and Becker and LeCun (1989). The Hessian is approximated with its diagonal part, so that the matrix-multiplication of the gradient by the inverse Hessian (see Newton's method) is approximated by dividing each gradient component gn by a running estimate of hnn inn).

(ef

221f the problem is badly scaled, for example, if the typical magnitude of the variables changes a lot, it is useful to substitute the identity matrix with Ha = max{ lE(wo)l,typical size of E} . D:, where D, is a diagonal scaling matrix, such that the new variables ti = D,w are in the same range [see Dennis and Schnabel(1983)l.

158

Roberto Battiti

At each iteration a particular weight w, is updated according to the following rule23: (6.3) The estimate Grin of the diagonal component of the Hessian is in turn obtained with an exponentiallLweighted average of the second derivative (or an estimate thereof d2E/dw$),as follows: (6.4) Suppose that the weight wnconnects the output of unit j to unit i (w,= wI,, in the double-index notation), a, is the total input to unit i, f ( ) is the ”squashing” function and xI is the state of unit j . It is easy to derive

The term d2E/i)af is then computed explicitly with a ”backpropagationtype” procedure, as follows: (6.6) Finally, for the simulations in LeCun (19891, the term in equation 6.6 with the second derivative of the squashing function is neglected, as in the Levenberg-Marquardt method, that will be described in Section 7, obtaining (6.7) Note that a positive estimate is obtained in this way (so that the negative gradient is multiplied by a positive-definite diagonal matrix). The parameters p and E in equation 6.3 and y in equation 6.4 are fixed and must be appropriately chosen by the user. The purpose of adding p to the diagonal approximation is explained by analogy with the trustregion method (see equation 4.9). According to Becker and LeCun (1989) the method does not bring a tremendous speed-up, but converges reliably without requiring extensive parameter adjustments. 231nthis rule we omit details related to weight sharing, that is, having more connections controlled by a single parameter.

First- and Second-Order Methods for Learning

159

7 Special Methods for Least Squares

If the error function that is to be minimized is the usual E = EL=lCyl: (opi - tp,)2,learning a set of examples consists in solving a nonlinear leastsquares problem, for which special methods have been devised. Two of these methods are now described: the first (the Gauss-Newton method) is based on simplifying the computation of second derivatives, the second (the Levenberg-Marquardt method) is a trust-region modification of the former. Let's define as R ( x ) the vector24 whose components are the residuals for the different patterns in the training set and output units [rpi(w)= (op,(w)- f p l ) ] , so that the error can be expressed as E = i R ( z ~ ) ~ R ( It w )is. straightforward to see that the first and second derivatives of E(w)are, respectively:

VE(w)

=

V2E(w)

= =

J ( w ) J(w) ~ + S(w)

(7.2)

where I ( w )is the Jacobian matrix ] ( x ) ~ ~= , , ar,,(w)/dw, and S(w)is the part of the Hessian containing second derivatives of rp1(w),that is, S ( w ) = EL rpl(w) V2rp1(w). The standard Newton iteration is the following:

w+ = wc

-

V(Wc)TI(wc)

+

s(wc)]-l I(wc)TR(wc)

17.3)

The particular feature of the problem is that, while I(wc) is easily calculated, S(wc) is not. On the other hand, a secant approximation seems part1 is easily ob"wasteful" because part of the Hessian [the I(W~)~J(W,) tained from ](w)and, in addition, the remaining part S(w) is negligible for small values of the residuals. The Gauss-Newton method consists in neglecting the S ( w ) part, so that a single iteration is

w+ = wc

-

u(Wc)TI(Wc)I-'

I M TR(wc)

(7.4)

It can be shown that this step is completely equivalent to minimizing the error obtained from using an affine model of R(w)around wc: (7.5)

(7.6) 24Thecouples of indices ( p , i) can be alphabetically ordered, for example, and mapped to a single index.

Roberto Battiti

160

The QR factorization method can be used for the solution of equation 7.5 [see Dennis and Schnabel (198311. The method is locally q-quadratically convergent for small residuals. If J(w,) has full column rank, J ( W ~ ) ~ J ( W ~ ) is nonsingular, the Gauss-Newton step is a descent direction and the method can be modified with line searches (dumped Gauss-Newton method). Another modification is based on the trust-region idea (see Section 4.3) and known as the Levenberg-M~rquardt method. The step is defined as

w+ = wc

-

u(wc)TJ(wc)+ /d-’ J(WJTR(WC)

(7.7)

This method can be used also if J ( w )does not have full column rank (this happens, for example, if the number of examples is less than the number of weights). It is useful to mention that the components of the Jacobian matrix arpj(w)/8wIcan be calculated by the usual ”chain rule” for derivatives with a number of backpropagation passes equal to the number of output units. If weight w,b connects unit b to unit a (please note that now the usual two-index notation is adopted), one obtains

where the term S,,,, is defined as S,,,a = [Or,,(w)/~netpfl] and X p b is the output of unit b. For the output layer, 6,,,@= f’(netpe)if i = a, 0 otherwise. C, I j g r , c ~ c R , summing over the units of For the other layers, 6,,,@= f’(netPR) the next layer (the one closer to the output). Software for variants of the Levenberg-Marquardt method can be found in Press et al. (1988). Other versions are MINPACK and NL2SOL (standard nonlinear least-squares packages).25 Additional techniques that are usable for large-scale least-squares problems are presented in Toint (1987). They are based on adaptive modeling of the objective function and have been used for problems with u p to thousands of variables.2b Additional references are Gawthrop and Sbarbaro (1990)and Kollias and Anastassiou (1989). In Kollias and Anastassiou (1989) the Levenberg-Marquardt technique is combined with the acceleration techniques described in Jacobs (1988) and Silva and Almeida (1990). 8 Other Heuristic Strategies

Some learning methods have been introduced specifically for backpropagation that show promising performance on some tests problems. 251 owe this information to Prof. Christopher G. Atkescm. MINPACK is described in More e t a / . (1980) and NLZSOL in Dennis ef al. (1981). 261nreality the test problems presented in this paper have a special “partially separable” structure, so that their practical application to multilayer perceptrons with complete connectivity is still a subject of research.

First- and Second-OrderMethods for Learning

161

Because the standard algorithm involves selecting appropriate learning and momentum rates, it is convenient to consider ways to adapt these parameters during the search process. In this case the trial-and-error selection is avoided and, in addition, the possibility to tune the parameter to the current properties of the ”error surface” usually yields faster convergence with respect to using fixed coefficients. An heuristic method for modifying the learning rate is, for example, described in Lapedes and Farber (19861, Vogl et al. (1988), and Battiti (1989) (the bold driver (BD) method). The idea is to increase the learning rate exponentially if successive steps reduce the error, and decrease it rapidly if an “accident” is encountered (increase of the error), until a proper rate is found (if the gradient is significantly different from zero, letting the step go to zero will eventually decrease the error). After starting with a small learning its modifications are described by the evolution equation:

E(t)

=

01

E(t

-

1)

if E [ w ( t ) ]< E[w(t- l)] if E [ w ( t ) ]2 E [ w ( t- l)]using t ( t - 1) (8.1)

where p is close to one (say p = 1.1)in order to avoid frequent ”accidents” (because the error computation is wasted in these cases), 0 is chosen to provide a rapid reduction (say (T = 0.5), and l is the minimum integer such that the reduced rate [ d E ( f - 1)l succeeds in diminishing the error. The performance of this “quick and dirty” version is close and usually better than the one obtained by appropriately choosing a fixed learning rate in batch BP. Suggestions for adapting both the search direction and the step along this direction are presented in Chan and Fallside (1987) and Jacobs (1988). In Chan and Fallside (1987)the learning and momentum rates are adapted to the structure of the error surface, by considering the angle O k between the last step and the gradient direction and by avoiding ”domination” of the weight update by the momentum term (in order to avoid ascent directions). The weight update is:

where

A comparison of different techniques is presented in Chan (1990). In Jacobs (1988) each individual weight has its own learning rate, that is modified in order to avoid oscillations. In the proposed “delta-bar-delta” 271f the initial rate is too large, some iterations are wasted to reduce it until an appropriate rate is found.

Roberto Battiti

162

method, the learning rate modification is the following: if 8(t - 1 ) 6 ( t ) > 0 if 8(t - 1 ) 6 ( t ) < o otherwise

(8.3)

where -

6 ( t ) = (1 - e)s(t) + e q t - 1)

The current partial derivative with respect to a weight is compared with an exponentially weighted average s(t) of the previous derivatives. If the signs agree, the (individual) learning rate E is increased by K’ if they disagree (a symptom of oscillations) it is decreased by a portion 4 of its value. A similar acceleration technique has been presented in Silva and Almeida (1990). An heuristic technique (quickprop) ”loosely based’’ on Newton‘s method is presented in Fahlman (1988). In quickprop each weight is updated in order to reach the minimum of an independent quadratic model (parabola). 9 Summary and Conclusions

A review of first- and second-order methods suitable for learning has been presented. Modifications of first-order backpropagation and secondorder methods are currently used by many researchers in different fields, both to obtain faster convergence times and to avoid the meta-optimization phase (where the user has to select proper parameters for learning, check the numerical stability, and, in general, optimize the performance of the learning method). This second point has a considerable impact in order to maintain, and possibly improve, the short development times required by neural networks with respect to methods based on intensive “knowledge engineering.“ In the previous sections some of these techniques have been referenced, with particular emphasis on their local and global convergence properties, numerical stability’ and memory /computation requirements. Some second-order techniques require a large amount of computation per iteration28[of order O(N2) or O(N3)l and/or large amounts of memory [of order O ( P ) l . Nonetheless, they are still applicable to problems with a limited number of weights (say < 100) and show superior performance with respect to standard backpropagation, especially if high precision in the input-output mapping function executed by the network is required. For problems with more weights, suitable approximations can be applied, obtaining methods with O ( N )behavior, while still maintaining some “safety” and “progress” properties of their parent methods. =It is clear that, in order to obtain the total learning time, the estimates must be multiplied by the number of patterns P and the total number of iterations. In this paper these factors are omitted for clarity of exposition.

First- and Second-Order Methods for Learning

163

While the presentation has focused onto the multilayer perceptron neural network, most of these techniques can be applied to alternative m0dels.2~ It is also worth stressing that problems related to memory requirements are less stringent now than when these methods were invented, and problems related to massive computation can be approached by using concurrent computation. Most of the presented techniques are suitable for a parallel implementation, with a speed-up that is approximately proportional to the number of processors employed [see, for example, Kramer and Sangiovanni-Vicentelli (1988); Battiti et al. (1990); Battiti and Straforini (199111. We feel that the cross-fertilization between optimization techniques and neural networks is fruitful and deserves further research efforts. In particular the relevance of second-order techniques to large-scale backpropagation tasks (with thousands of weights and examples) is a subject that deserves additional studies and comparative experiments.

Acknowledgments The author is indebted to Profs. Geoffrey Fox, Roy Williams, and Edoardo Amaldi for helpful discussions. Thanks are due to Profs. Tommaso Poggio, Christopher Atkeson, and Michael Jordan for sharing their views on second-order methods. The results of a ”second-order survey” by Eric A. Wan (Neuron Digesf 1989, 6-53) were a useful source of references. The referee’s detailed comments are also greatly appreciated. Part of this work was completed while the author was a research assistant at Caltech. The research group was supported in part by DOE Grant DE-FG03-85ER25009, The National Science Foundation with Grant IST-8700064, and by IBM.

References Allred, L. G., and Kelly, G. E. 1990. Supervised learning techniques for backpropagation networks. Proc. Int. Joint Conf. Neural Networks (IJCNN), Washington I, 721-728. Barnard, E., and Cole, R. 1988. A neural-net training program based on conjugate gradient optimization. Oregon Graduate Center, CSE 89-014. Battiti, R. 1989. Accelerated back-propagationlearning: Two optimization methods. Complex Syst. 3, 331-342. Battiti, R., and Masulli, F. 1990. BFGS optimization for faster and automated supervised learning. Proc. Int. Neural Network Conf. (INNC go), Paris, France 757-760. 29For example the RBF model introduced in Broomhead and Lowe (1988) and Poggio and Girosi (1990).

164

Roberto Battiti

Battiti, R., and Straforini, M. 1991. Parallel supervised learning with the memoryless quasi-Newton method. In Parallel Computing: Problems, Methods and Applications, P. Messina and A. Murli, eds. Elsevier, Amsterdam. Battiti, R., Colla, A. M., Briano, L. M., Cecinati, R., and Guido, P. 1990. An application-oriented development environment for neural net models on the multiprocessor Emma-2. Proc. IFIP Workshop Silicon Architectures Neural Nets, St Paul de Vence, France, M. Somi and J. Calzadillo-Daguerre, eds. NorthHolland, Amsterdam. Becker, S., and LeCun, Y. 1989. Improving the convergence of backpropagation learning with second-order methods. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds. Morgan Kaufmann, San Mateo, CA. Bengio, Y., and Moore, B. 1989. Acceleration of learning. Proc. GNCB-CNR School, Trento, Italy. Bingham, J. A. C. 1988. The Theory and Practice of Modem Design. Wiley, New York. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2,321-355. Broyden, C. G., Dennis, J. E., and More, J. J. 1973. On the local and superlinear convergence of quasi-Newton methods. J.I.M.A. 12, 223-246. Cardon, H., van Hoogstraten, R., and Davies, P. 1991. A neural network application in geology: Identification of genetic facies. Proc. Int. Conf. Artificial Neural Networks (ICANN-91),Espoo, Finland 1519-1522. Chan, L. W. 1990. Efficacy of different learning algorithms of the backpropagation network. Proc. I E E E TENCON-90. Chan, L. W., and Fallside, F. 1987. An adaptive training algorithm for back propagation networks. Cornput. Speech Language 2,205-218. Dennis, J. E., and Schnabel, R. B. 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, Englewood Cliffs, NJ. Dennis, J. E., Gay, D. M., and Welsch, R. E. 1981. Algorithm 573 NL2SOL An adaptive nonlinear least-squares algorithm [E4]. TOMS 7,369-383. Dongarra, J. J., Moler, C. B., Bunch, J. R., and Stewart, G. W. 1979. LINPACK Users's Guide. Siam, Philadelphia. Drago, G. P., and Ridella, S. 1991. An optimum weights initialization for improving scaling relationships in BP learning. Proc. Int. Conf. Artificial Neural Networks (ICA"-91 ), Espoo, Finland 1519-1 522. Fahlman, S. E. 1988. An empirical study of learning speed in back-propagation networks. Preprint CMU-CS-88-162, Carnegie Mellon University, Pittsburgh, PA. Gawthrop, I?, and Sbarbaro, D. 1990. Stochastic approximation and multilayer perceptrons: The gain backpropagation algorithm. Complex Syst. 4,51-74. Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic Press, London. Glover, F. 1987. TABU Search methods in artificial intelligence and operations research. ORSA Art. Int. Nezuslett. l(2.6). Goldfarb, D. 1976. Factorized variable metric methods for unconstrained optimization. Math. Comp. 30,796-811.

First- and Second-Order Methods for Learning

165

Goldstein, A. A. 1967. Constructive Real Analysis. Harper & Row, New York, NY. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295-307. Johansson, E. M., Dowla, F. U., and Goodman, D. M. 1990. Backpropagation learning for multi-layer feed-forward neural networks using the conjugate gradient method. Lawrence Livermore National Laboratory, Preprint UCRLJC-104850. Kollias, S., and Anastassiou, D. 1989. An adaptive least squares algorithm for the efficient training of multilayered networks. I E E E Trans. CAS 36, 10921101. Kramer, A. H., and Sangiovanni-Vicentelli, A. 1988. Efficient parallel learning algorithms for neural networks. In Advances in Neural Information Processing Systems, Vol. 1, pp. 75-89. Morgan Kaufmann, San Mateo, CA. Lapedes, A., and Farber, R. 1986. A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition. Physica 22 D, 247-259. LeCun, Y. 1986. HLM A multilayer learning network. Proc. 2986 Connectionist Models Summer School, Pittsburgh 169-177. LeCun, Y. 1989. Generalization and network design strategies. In Connectionism in Perspective, pp. 143-155. North Holland, Amsterdam. LeCun, Y., Kanter, I., and Solla, S. A. 1991. Second order properties of error surfaces: Learning time and generalization. In Neural Information Processing Systems - NIPS, Vol. 3, pp, 918-924. Morgan Kaufmann, San Mateo, CA. Luenberger, D. G. 1973. Introduction to Linear and Nonlinear Programming. Addison-Wesley, New York. Luo, Zhi-Quan. 1991. On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comp. 3, 227-245. Malferrari, L., Serra, R., and Valastro, G. 1990. Using neural networks for signal analysis in oil well drilling. Proc. 111Ital. Workshop Parallel Architect. Neural Networks - Vietri s/m Salerno, Italy 345-353. World Scientific, Singapore. Merller, M. F. 1990. A scaled conjugate gradient algorithm for fast supervised learning. PB-339 Preprint. Computer Science Department, University of Aarhus, Denmark. Neural Networks, to be published. MorP, J. J., Garbow, B. S., and Hillstrom, K. E. 1980. User guide for MINPACK-1. Argonne National Labs Report ANL-80-74. Nocedal, J. 1980. Updating quasi-Newton matrices with limited storage. Math. Comp. 35, 773-782. Parker, D. B. 1987. Optimal algorithms for adaptive networks: Second-order back propagation, second-order direct propagation, and second-order Hebbian learning. Proc. ICNN-1, San Diego, C A 11-593-11-600. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Press, W. H., Flannery, B. I?, Teukolsky, S. A., and Wetterling, W. T. 1988. Numerical Recipes in C. Cambridge University Press, Cambridge. Rumelhart, D. E., and McClelland, J. L. (eds.) 1986. Parallel DistributedProcessing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press, Cambridge, MA.

166

Roberto Battiti

Sejnowski, T. J., and Rosenberg, C. R. 1986. NETtalk: A parallel network that learns to read aloud. The Johns Hopkins University EE and CS Tech. Rep. JHU/ EECS-86 / 01. Shanno, D. F. 1978. Conjugate gradient methods with inexact searches. Math. Oper. Res. 3-3, 244-256. Silva, F., and Almeida, L. 1990. Acceleration techniques for the backpropagation algorithm. Lecture Notes in Computer Science, Vol. 412, pp. 110-119. SpringerVerlag, Berlin. Toint, L. 1987. On large scale nonlinear least squares calculations. SlAh4 I . Sci. Stat. Cornput. 8(3), 416-435. Vincent, J. M. 1991. Facial feature location in coarse resolution images by multilayered perceptrons. PYOC.lnt. Conf. Artificial Neuraf Networks (ICANN-%I, Espoo, Finland 821-826. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., and Alkon, D. L. 1988. Accelerating the convergence of the back-propagation method. Biol. Cybernet. 59, 257-263. Watrous, R. 1987. Learning algorithms for connectionisf networks: Applied gradient methods of nonlinear optimization. Tech. Rep. MS-CIS-87-51, University of Pennsylvania. Webb, A. R., Lowe, D., and Bedworth, M. D. 1988. A comparison of nonlinear optimization strategies for adaptive feed-forward layered networks. RSRE MEMO 4157, Royal Signals and Radar Establishment, Malvern, England. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Corny. 1,425464. Widrow, B., and Stearns, S. D. 1985. Adaptive Signal Processing. Prentice Hall, Englewood Cliffs, NJ. Williams, P. M. 1991. A Morquordt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Preprint of the School of Cognitive and Computing Sciences, University of Sussex (13 February 1991). ~~

Received 28 December 1990; accepted 13 September 1991.

This article has been cited by: 1. Konuralp Ilbay, Elif Derya Übeyli, Gul Ilbay, Faik Budak. 2010. Recurrent Neural Networks for Diagnosis of Carpal Tunnel Syndrome Using Electrophysiologic Findings. Journal of Medical Systems 34:4, 643-650. [CrossRef] 2. Fabricio J. Pontes, João R. Ferreira, Messias B. Silva, Anderson P. Paiva, Pedro Paulo Balestrassi. 2010. Artificial neural networks for machining processes surface roughness modeling. The International Journal of Advanced Manufacturing Technology 49:9-12, 879-902. [CrossRef] 3. Ugo Andreaus, Michele Colloca, Daniela Iacoviello, Marcello Pignataro. 2010. Optimal-tuning PID control of adaptive materials for structural efficiency. Structural and Multidisciplinary Optimization . [CrossRef] 4. Mauri Aparecido Oliveira. 2010. The influence of ARIMA-GARCH parameters in feed forward neural networks prediction. Neural Computing and Applications . [CrossRef] 5. Ahmad Neyamadpour, W. A. T. Wan Abdullah, Samsudin Taib. 2010. Inversion of quasi-3D DC resistivity imaging data using artificial neural networks. Journal of Earth System Science 119:1, 27-40. [CrossRef] 6. Ricardo de A. Araújo. 2010. A quantum-inspired evolutionary hybrid intelligent approach for stock market prediction. International Journal of Intelligent Computing and Cybernetics 3:1, 24-54. [CrossRef] 7. Elif Derya Übeyli. 2009. Detecting variabilities of ECG signals by Lyapunov exponents. Neural Computing and Applications 18:7, 653-662. [CrossRef] 8. Fi-John Chang, Yen-Ming Chiang, Wong-Shuo Lee. 2009. Investigating the impact of the Chi-Chi earthquake on the occurrence of debris flows using artificial neural networks. Hydrological Processes 23:19, 2728-2736. [CrossRef] 9. Mohd Basyaruddin Abdul Rahman, Naz Chaibakhsh, Mahiran Basri, Abu Bakar Salleh, Raja Noor Zaliha Raja Abdul Rahman. 2009. Application of Artificial Neural Network for Yield Prediction of Lipase-Catalyzed Synthesis of Dioctyl Adipate. Applied Biochemistry and Biotechnology 158:3, 722-735. [CrossRef] 10. Alberto L Delis, João L A Carvalho, Adson F da Rocha, Renan U Ferreira, Suélia S Rodrigues, Geovany A Borges. 2009. Estimation of the knee joint angle from surface electromyographic signals for active control of leg prostheses. Physiological Measurement 30:9, 931-946. [CrossRef] 11. A. Shokuhfar, M. N. Samani, N. Naserifar, P. Heidary, G. Naderi. 2009. Prediction of physical properties of Al 2 TiO 5 -based ceramics containing micro and nano size oxide additives by using artificial neural network. Materialwissenschaft und Werkstofftechnik 40:3, 169-177. [CrossRef] 12. B Dengiz, C Alabas-Uslu, O Dengiz. 2009. A tabu search algorithm for the training of neural networks. Journal of the Operational Research Society 60:2, 282-291. [CrossRef]

13. H. Chin, Y. Hwang, S. Deng. 2009. Heat Flux Estimation in Nonlinear Materials Using Kalman Filter-Enhanced Neural Network. Journal of Thermophysics and Heat Transfer 23:1, 41-49. [CrossRef] 14. G.E. Asimakopoulou, V.T. Kontargyri, G.J. Tsekouras, F.E. Asimakopoulou, I.F. Gonos, I.A. Stathopulos. 2009. Artificial neural network optimisation methodology for the estimation of the critical flashover voltage on insulators. IET Science, Measurement & Technology 3:1, 90. [CrossRef] 15. D. Ramakrishnan, T. N. Singh, N. Purwar, K. S. Barde, Akshay. Gulati, S. Gupta. 2008. Artificial neural network and liquefaction susceptibility assessment: a case study using the 2001 Bhuj earthquake data, Gujarat, India. Computational Geosciences 12:4, 491-501. [CrossRef] 16. Jiann-Ming Wu. 2008. Multilayer Potts Perceptrons With Levenberg–Marquardt Learning. IEEE Transactions on Neural Networks 19:12, 2032-2043. [CrossRef] 17. Elif Derya Übeyli. 2008. Signal-to-noise ratios for measuring saliency of features extracted by eigenvector methods from ophthalmic arterial Doppler signals. Expert Systems 25:5, 431-443. [CrossRef] 18. Tiago A. E. Ferreira, Germano C. Vasconcelos, Paulo J. L. Adeodato. 2008. A New Intelligent System Methodology for Time Series Forecasting with Artificial Neural Networks. Neural Processing Letters 28:2, 113-129. [CrossRef] 19. K Vidya Shetty, Santosh Nandennavar, G Srinikethan. 2008. Artificial neural networks model for the prediction of steady state phenol biodegradation in a pulsed plate bioreactor. Journal of Chemical Technology & Biotechnology 83:9, 1181-1189. [CrossRef] 20. Kar-Ann Toh. 2008. Deterministic Neural ClassificationDeterministic Neural Classification. Neural Computation 20:6, 1565-1595. [Abstract] [PDF] [PDF Plus] 21. Jian-Xun Peng, Kang Li, George W. Irwin. 2008. A New Jacobian Matrix for Optimal Learning of Single-Layer Neural Networks. IEEE Transactions on Neural Networks 19:1, 119-129. [CrossRef] 22. Y. Hwang, S. Deng. 2008. Applying Neural Networks to the Solution of the Inverse Heat Conduction Problem in a Gun Barrel. Journal of Pressure Vessel Technology 130:3, 031203. [CrossRef] 23. Mohammad Saadatseresht, Masood Varshosaz. 2007. Visibility prediction based on artificial neural networks used in automatic network design. The Photogrammetric Record 22:120, 336-355. [CrossRef] 24. Elif Derya Übeyli, Mustafa Übeyli. 2007. Investigating Neutronic Parameters of a Thorium Fusion Breeder with Recurrent Neural Networks. Journal of Fusion Energy 26:4, 323-330. [CrossRef]

25. A. Alessandri, M. Cuneo, S. Pagnan, M. Sanguineti. 2007. A recursive algorithm for nonlinear least-squares problems. Computational Optimization and Applications 38:2, 195-216. [CrossRef] 26. Giuseppe Nunnari, Flavio CannavÓ. 2007. A New Cost Function for Air Quality Modeling. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 49:2, 281-290. [CrossRef] 27. J. S. Son, I. S. Kim, H. H. Kim, I. J. Kim, B. Y. Kang, H. J. Kim. 2007. A study on the prediction of bead geometry in the robotic welding system. Journal of Mechanical Science and Technology 21:10, 1726-1731. [CrossRef] 28. Atsushi Teramoto, Takayuki Murakoshi, Masatoshi Tsuzaka, Hiroshi Fujita. 2007. Automated Solder Inspection Technique for BGA-Mounted Substrates by Means of Oblique Computed Tomography. IEEE Transactions on Electronics Packaging Manufacturing 30:4, 285-292. [CrossRef] 29. Babita Saini, V. K. Sehgal, M. L. Gambhir. 2007. Least-cost design of singly and doubly reinforced concrete beam using genetic algorithm optimized artificial neural network based on Levenberg–Marquardt and quasi-Newton backpropagation learning techniques. Structural and Multidisciplinary Optimization 34:3, 243-260. [CrossRef] 30. Necdet Süt, Mustafa Şenocak. 2007. Assessment of the performances of multilayer perceptron neural networks in comparison with recurrent neural networks and two statistical methods for diagnosing coronary artery disease. Expert Systems 24:3, 131-142. [CrossRef] 31. K. Guney, S. S. Gultekin. 2007. A comparative study of neural networks for input resistance computation of electrically thin and thick rectangular microstrip antennas. Journal of Communications Technology and Electronics 52:5, 483-492. [CrossRef] 32. Jian Wu, Qiang Huo. 2007. A Study of Minimum Classification Error (MCE) Linear Regression for Supervised Adaptation of MCE-Trained Continuous-Density Hidden Markov Models. IEEE Transactions on Audio, Speech and Language Processing 15:2, 478-488. [CrossRef] 33. FI-JOHN CHANG, YEN-MING CHIANG, LI-CHIU CHANG. 2007. Multi-step-ahead neural networks for flood forecasting / Reseaux de neurones a echeances multiples pour la prevision de crue. Hydrological Sciences Journal 52:1, 114-130. [CrossRef] 34. Elif Derya Übeyli. 2007. Comparison of different classification algorithms in clinical decision-making. Expert Systems 24:1, 17-31. [CrossRef] 35. Erik McDermott, Timothy J. Hazen, Jonathan Le Roux, Atsushi Nakamura, Shigeru Katagiri. 2007. Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error. IEEE Transactions on Audio, Speech and Language Processing 15:1, 203-223. [CrossRef]

36. A. Bazaei, M. Moallem. 2007. Online neural identification of multi-input multi-output systems. IET Control Theory & Applications 1:1, 44. [CrossRef] 37. Zhihong Man, Hong Wu, Sophie Liu, Xinghuo Yu. 2006. A New Adaptive Backpropagation Algorithm Based on Lyapunov Stability Theory for Neural Networks. IEEE Transactions on Neural Networks 17:6, 1580-1591. [CrossRef] 38. Khamron Sunat, Chidchanok Lursinsap, Chee-Hung Henry Chu. 2006. The p-recursive piecewise polynomial sigmoid generators and first-order algorithms for multilayer tanh-like neurons. Neural Computing and Applications 16:1, 33-47. [CrossRef] 39. H. Salazar, R. Gallego, R. Romero. 2006. Artificial Neural Networks and Clustering Techniques Applied in the Reconfiguration of Distribution Systems. IEEE Transactions on Power Delivery 21:3, 1735-1742. [CrossRef] 40. Inan Guler, Elif Derya Ubeyli. 2006. Combined neural network model to compute wavelet coefficients. Expert Systems 23:3, 159-173. [CrossRef] 41. T. N. Singh, A. R. Gupta, R. Sain. 2006. A Comparative Analysis of Cognitive Systems for the Prediction of Drillability of Rocks and Wear Factor. Geotechnical and Geological Engineering 24:2, 299-312. [CrossRef] 42. J.-N. Marie-Francoise, H. Gualous, A. Berthon. 2006. Supercapacitor thermal- and electrical-behaviour modelling using ANN. IEE Proceedings - Electric Power Applications 153:2, 255. [CrossRef] 43. Sheng Wan, Larry E. Banta. 2006. Parameter Incremental Learning Algorithm for Neural Networks. IEEE Transactions on Neural Networks 17:6, 1424-1438. [CrossRef] 44. Carmine Di Fiore, Stefano Fanelli, Paolo Zellini. 2005. Low-complexity minimization algorithms. Numerical Linear Algebra with Applications 12:8, 755-768. [CrossRef] 45. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 46. M H Shaheed, H Poerwanto, M O Tokhi. 2005. Adaptive Inverse-Dynamic and Neuro-Inverse-Dynamic Active Vibration Control of a Single-Link Flexible Manipulator. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 219:6, 431-448. [CrossRef] 47. D. Erdogmus, O. Fontenla-Romero, J.C. Principe, A. Alonso-Betanzos, E. Castillo. 2005. Linear-Least-Squares Initialization of Multilayer Perceptrons Through Backpropagation of the Desired Response. IEEE Transactions on Neural Networks 16:2, 325-337. [CrossRef] 48. D. S. Vlachos. 2004. A Local Supervised Learning Algorithm For Multi-Layer Perceptrons. Applied Numerical Analysis & Computational Mathematics 1:2, 535-539. [CrossRef]

49. I. Baturone, F.J. Moreno-Velo, S. Sanchez-Solano, A. Ollero. 2004. Automatic Design of Fuzzy Controllers for Car-Like Autonomous Robots. IEEE Transactions on Fuzzy Systems 12:4, 447-465. [CrossRef] 50. S. Chakraborty, R. Ghosh, M. Ghosh, C. D. Fernandes, M. J. Charchar, S. Kelemu. 2004. Weather-based prediction of anthracnose severity using artificial neural network models. Plant Pathology 53:4, 375-386. [CrossRef] 51. R.A. Morejon, J.C. Principe. 2004. Advanced Search Algorithms for Information-Theoretic Learning With Kernel-Based Estimators. IEEE Transactions on Neural Networks 15:4, 874-884. [CrossRef] 52. Kar-Ann Toh. 2003. Deterministic global optimization for fnn training. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:6, 977-983. [CrossRef] 53. Shih-Lin Hung, C. S. Huang, C. M. Wen, Y. C. Hsu. 2003. Nonparametric Identification of a Building Structure from Experimental Data Using Wavelet Neural Network. Computer-Aided Civil and Infrastructure Engineering 18:5, 356-368. [CrossRef] 54. Mehmet �nder Efe, Okyay Kaynak, Bogdan M. Wilamowski, Xinghuo Yu. 2003. A robust on-line learning algorithm for intelligent control systems. International Journal of Adaptive Control and Signal Processing 17:6, 489-500. [CrossRef] 55. M.R.G. Meireles, P.E.M. Almeida, M.G. Simoes. 2003. A comprehensive review for industrial applicability of artificial neural networks. IEEE Transactions on Industrial Electronics 50:3, 585-601. [CrossRef] 56. A. Bortoletti, C. Di Fiore, S. Fanelli, P. Zellini. 2003. A new class of quasi-newtonian methods for optimal learning in mlp-networks. IEEE Transactions on Neural Networks 14:2, 263-273. [CrossRef] 57. Jeen-Shing Wang, C.S.G. Lee. 2002. Self-adaptive neuro-fuzzy inference systems for classification applications. IEEE Transactions on Fuzzy Systems 10:6, 790-802. [CrossRef] 58. V.P. Plagianakos, G.D. Magoulas, M.N. Vrahatis. 2002. Deterministic nonmonotone strategies for effective training of multilayer perceptrons. IEEE Transactions on Neural Networks 13:6, 1268-1284. [CrossRef] 59. N. Ampazis, S.J. Perantonis. 2002. Two highly efficient second-order algorithms for training feedforward networks. IEEE Transactions on Neural Networks 13:5, 1064-1074. [CrossRef] 60. Christopher Monterola, May Lim, Jerrold Garcia, Caesar Saloma. 2002. Accurate forecasting of the undecided population in a public opinion poll. Journal of Forecasting 21:6, 435-449. [CrossRef] 61. Enrique Castillo , Oscar Fontenla-Romero , Bertha Guijarro-Berdiñas , Amparo Alonso-Betanzos . 2002. A Global Optimum Approach for One-Layer Neural NetworksA Global Optimum Approach for One-Layer Neural Networks. Neural Computation 14:6, 1429-1449. [Abstract] [PDF] [PDF Plus]

62. G.D. Magoulas, V.P. Plagianakos, M.N. Vrahatis. 2002. Globally convergent algorithms with local learning rates. IEEE Transactions on Neural Networks 13:3, 774-779. [CrossRef] 63. A. Alessandri, M. Sanguineti, M. Maggiore. 2002. Optimization-based learning with bounded error for feedforward neural networks. IEEE Transactions on Neural Networks 13:2, 261-273. [CrossRef] 64. L.M. Saini, M.K. Soni. 2002. Artificial neural network based peak load forecasting using Levenberg–Marquardt and quasi-Newton methods. IEE Proceedings - Generation, Transmission and Distribution 149:5, 578. [CrossRef] 65. Jinwook Go, Gunhee Han, Hagbae Kim, Chulhee Lee. 2001. Multigradient: a new neural network learning algorithm for pattern classification. IEEE Transactions on Geoscience and Remote Sensing 39:5, 986-993. [CrossRef] 66. William J. Egan, S. Michael Angel, Stephen L. Morgan. 2001. Rapid optimization and minimal complexity in computational neural network multivariate calibration of chlorinated hydrocarbons using Raman spectroscopy. Journal of Chemometrics 15:1, 29-48. [CrossRef] 67. J. Fernandez de Caflete, A. Barreiro, A. Garcia-Cerezo, I. Garcia-Moral. 2001. An input-output based robust stabilization criterion for neural-network control of nonlinear systems. IEEE Transactions on Neural Networks 12:6, 1491-1497. [CrossRef] 68. John H. Xin, Sijie Shao, Korris Fu-lai Chung. 2000. Colour-appearance modeling using feedforward networks with Bayesian regularization method? Part I: Forward model. Color Research & Application 25:6, 424-434. [CrossRef] 69. A. Navia-Vázquez , A. R. Figueiras-Vidal . 2000. Efficient Block Training of Multilayer PerceptronsEfficient Block Training of Multilayer Perceptrons. Neural Computation 12:6, 1429-1447. [Abstract] [PDF] [PDF Plus] 70. Xingen Wu, Yunping Zhu. 2000. Physics in Medicine and Biology 45:4, 913-922. [CrossRef] 71. Kazumi Saito , Ryohei Nakano . 2000. Second-Order Learning Algorithm with Squared Penalty TermSecond-Order Learning Algorithm with Squared Penalty Term. Neural Computation 12:3, 709-729. [Abstract] [PDF] [PDF Plus] 72. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 73. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 74. G. D. Magoulas , M. N. Vrahatis , G. S. Androulakis . 1999. Improving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation MethodsImproving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation Methods. Neural Computation 11:7, 1769-1796. [Abstract] [PDF] [PDF Plus]

75. A.-P. N. Refenes, A. D. Zapranis. 1999. Neural model identification, variable selection and model adequacy. Journal of Forecasting 18:5, 299-332. [CrossRef] 76. J. Fernández De Cañete, A. García-Cerezo, I. García-Moral. 1999. Neural control experiments via dynamic neural algorithms. International Journal of Adaptive Control and Signal Processing 13:4, 273-289. [CrossRef] 77. P. RoyChowdhury, Y.P. Singh, R.A. Chansarkar. 1999. Dynamic tunneling technique for efficient training of multilayer perceptrons. IEEE Transactions on Neural Networks 10:1, 48-55. [CrossRef] 78. S. McLoone, M.D. Brown, G. Irwin, A. Lightbody. 1998. A hybrid linear/nonlinear training algorithm for feedforward neural networks. IEEE Transactions on Neural Networks 9:4, 669-684. [CrossRef] 79. G. Zhou, J. Si. 1998. Advanced neural-network training algorithm with reduced complexity based on Jacobian deficiency. IEEE Transactions on Neural Networks 9:3, 448-453. [CrossRef] 80. T. Warren Liao, L. J. Chen. 1998. Manufacturing Process Modeling and Optimization Based on Multi-Layer Perceptron Network. Journal of Manufacturing Science and Engineering 120:1, 109. [CrossRef] 81. D.R. Hush, B. Horne. 1998. Efficient algorithms for function approximation with piecewise linear sigmoidal networks. IEEE Transactions on Neural Networks 9:6, 1129-1141. [CrossRef] 82. Andrew J. Myles, Alan F. Murray, A. Robin Wallace, John Barnard, Gordon Smith. 1997. Estimating MLP generalisation ability without a test set using fast, approximate leave-one-out cross-validation. Neural Computing & Applications 5:3, 134-151. [CrossRef] 83. Wael El-Deredy. 1997. Pattern recognition approaches in biomedical and clinical magnetic resonance spectroscopy: a review. NMR in Biomedicine 10:3, 99-124. [CrossRef] 84. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 85. T. Ishikawa, M. Matsunami. 1997. An optimization method based on radial basis function. IEEE Transactions on Magnetics 33:2, 1868-1871. [CrossRef] 86. Kazumi Saito, Ryohei Nakano. 1997. Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural NetworksPartial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks. Neural Computation 9:1, 123-141. [Abstract] [PDF] [PDF Plus] 87. J. T. Connor. 1996. A robust neural network filter for electricity demand prediction. Journal of Forecasting 15:6, 437-458. [CrossRef] 88. Paolo Ienne, Thierry Cornu, Gary Kuhn. 1996. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI Signal Processing 13:1, 5-25. [CrossRef]

89. Randall C. O'Reilly. 1996. Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation AlgorithmBiologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm. Neural Computation 8:5, 895-938. [Abstract] [PDF] [PDF Plus] 90. S. Kuntanapreeda, R.R. Fullmer. 1996. A training rule which guarantees finite-region stability for a class of closed-loop neural-network control systems. IEEE Transactions on Neural Networks 7:3, 745-751. [CrossRef] 91. A. Materka, S. Mizushina. 1996. Parametric signal restoration using artificial neural networks. IEEE Transactions on Biomedical Engineering 43:4, 357-372. [CrossRef] 92. Yi Shang, B.W. Wah. 1996. Global optimization for neural network training. Computer 29:3, 45-54. [CrossRef] 93. Takio Kurita, Hideki Asoh, Shinji Umeyama, Shotaro Akaho, Akitaka Hosomi. 1996. Influence of noises added to hidden units on learning of multilayer perceptrons and structurization of networks. Systems and Computers in Japan 27:11, 64-73. [CrossRef] 94. R. Parisi, E.D. Di Claudio, G. Orlandi, B.D. Rao. 1996. A generalized learning paradigm exploiting the structure of feedforward neural networks. IEEE Transactions on Neural Networks 7:6, 1450-1460. [CrossRef] 95. M. Di Martino, S. Fanelli, M. Protasi. 1996. Exploring and comparing the best "direct methods" for the efficient training of MLP-networks. IEEE Transactions on Neural Networks 7:6, 1497-1502. [CrossRef] 96. Anthony J. Bell , Terrence J. Sejnowski . 1995. An Information-Maximization Approach to Blind Separation and Blind DeconvolutionAn Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7:6, 1129-1159. [Abstract] [PDF] [PDF Plus] 97. Paolo Ienne, Marc A. Viredaz. 1995. GENES IV: A bit-serial processing element for a multi-model neural-network accelerator. Journal of VLSI Signal Processing 9:3, 257-273. [CrossRef] 98. Kuo-lin Hsu, Hoshin Vijai Gupta, Soroosh Sorooshian. 1995. Artificial Neural Network Modeling of the Rainfall-Runoff Process. Water Resources Research 31:10, 2517. [CrossRef] 99. Thorsteinn Rögnvaldsson . 1994. On Langevin Updating in Multilayer PerceptronsOn Langevin Updating in Multilayer Perceptrons. Neural Computation 6:5, 916-926. [Abstract] [PDF] [PDF Plus] 100. B. Apolloni, G. Ronchini. 1994. Dynamic sizing of multilayer perceptrons. Biological Cybernetics 71:1, 49-63. [CrossRef] 101. Eitan Michael Azoff. 1993. Reducing error in neural network time series forecasting. Neural Computing & Applications 1:4, 240-247. [CrossRef]

102. Michele Ceccarelli, Alfredo Petrosino, Roberto Vaccaro. 1993. Competitive neural networks on message-passing parallel computers. Concurrency: Practice and Experience 5:6, 449-470. [CrossRef] 103. Terrence L. FineFeedforward Neural Nets . [CrossRef]

ARTICLE

Communicated by Ramamohan Paturi

Efficient Simplex-Like Methods for Equilibria of Nonsymmetric Analog Networks Douglas A. Miller Steven W. Zucker Computer Vision and Robotics Laboratory, Research Centre for Intelligent Machines, McGill University,3480 University Street, R. 410, Montrial, Canada H3A 2A7

What is the complexity of computing equilibria for physically implementable analog networks (Hopfield 1984; Sejnowski 1981) with arbitrary connectivity? We show that if the amplifiers are piecewise 1'inear, then such networks are instances of a game-theoretic model known as poZymatrix games. In contrast with the usual gradient descent methods for symmetric networks, equilibria for polymatrix games may be computed by vertex pivoting algorithms similar to the simplex method for linear programming. Like the simplex method, these algorithms have characteristic low order polynomial behavior in virtually all practical cases, though not certain theoretical ones. While these algorithms cannot be applied to models requiring evolution from an initial point, they are applicable to "clamping" models whose input is expressed purely as a bias. Thus we have an a priori indication that such models are computationally tractable.

-

1 Introduction A fundamental question is: Do biological or other physical systems solve problems that are NP-hard for Turing machines? Hopfield (1984) and Hopfield and Tank (1985) have provided evidence in the negative, suggesting that to the extent real or artificial neural systems appear to solve NP-hard problems (e.g., the traveling salesman problem), this is only illusory. Hopfield has taken the position that biological computation amounts to designing analog networks with appropriate asymptotically stable equilibria. What is really being solved are not NP-hard problems, but only much easier approximations, which amount to finding these equilibria. Thus, Hopfield seems implicitly to accept the Strong Church's Thesis of Vergis et al. (19861, which implies (accepting P # NP) that no analog computer (dynamical system) can solve NP-hard problems with less than exponential resources. (See Pour-El and Richards 1981 for an alternative though not contradictory view.) A similar point of view has also been taken by Hummel and Zucker (1983) and Zucker et al. (1989) with regard to the solving of problems in Neural Computation 4,167-190 (1992)

@ 1992 Massachusetts Institute of Technology

168

Douglas A. Miller and Steven W. Zucker

vision. Here there are instances, such as in the interpretation of line drawings (Kurousis and Papadimitriou 19881, where one might be tempted to think the brain is solving NP-hard problems, whereas what seems much more likely is that the brain finds only quick approximations. However, even accepting this “stable state” view of biological computation, the question remains: How do we know that finding an asymptotically stable equilibrium of a dynamical system is computationally an easy problem in the Turing sense (e.g., polynomial)? This question seems especially important for nonsymmetric networks, where there is no descent function guaranteeing convergence. However, even in the symmetric case it is known to be NP-hard just to decide if a given point is a local minimum for a constrained quadratic! (Murty 1988, p. 170; Vergis et al. 1986). In this paper we offer a partial answer to this question. We show that computing an equilibrium (not necessarily stable) for a Hopfield network with piecewise-linear amplifiers and arbitrary connectivity may be done with a type of vertex pivoting algorithm, Lemke‘s algorithm, which is very similar in its complexity to the simplex method for linear programming. The latter is strongly polynomial in practice although not necessarily so in theory. (An analysis of this phenomenon based on probability theory has become an outstanding mathematical question in recent decades, which has been partly answered - e.g., Adler et al. 1984.) Pivoting methods such as Lemke’s algorithm would appear to be the only known algorithms for finding equilibria of nonsymmetric networks that are both guaranteed to work and have, at least in a probabilistic sense, polynomial complexity. On the other hand Lemke’s algorithm has two characteristics that sharply distinguish it from more traditional techniques of following integral curves through vector (usually gradient) fields. First, it has no sensitivity to initial conditions. Second, there is no guarantee it will produce a stable equilibrium. In many respects the algorithm could be expected to behave like a procedure that quickly picked an equilibrium at random. These characteristics suggest a different approach to continuous dynamic models than has been taken so far in such applications as Hopfield and Tank‘s traveling salesman network and the vision relaxation network of Zucker et al. In these applications there are at all times an extremely large number of equilibria, and the evolution of the system is determined by the initial state. The supposition is that this initial state will evolve to a stable final state in its basin of attraction. There are, however, several possible problems with this kind of computation. First there is no guarantee, at least for a nonsymmetric network, than an attractive basin need exist (cf. Appendix A). Second, there is no guarantee that convergence, if it does occur, will be rapid, especially in numerical implementations. A well-known example of this kind of problem is zigzagging behavior for steepest descent methods (Luenberger 1973). Third, an initial position

Equilibria of Nonsymmetric Analog Networks

169

may be unstable in an especially bad way, by lying near the boundaries of a large number of different attractive basins, and thus requiring impractically large precision for a useful dynamic simulation. This appears to have been the case in the Wilson and Pawley (1988) simulations of the Hopfield and Tank traveling salesman network. An alternative to this ”initial position“ view of computation is instead to express an input vector as a bias on some subset of the processing units, and then use Lemke’s algorithm. If the bias is sufficiently large we get the kind of ”clamping” described by Hinton and Sejnowski (1986) in the context of Boltzmann machines. The effect, for an appropriately designed or trained network, would be to eliminate the great mass of equilibria that exist in the unbiased state, leaving the system ideally with just one equilibrium state. The fact that we could use Lemke’s algorithm would a priori indicate that the model was computationally tractable. Furthermore this kind of network computation would seem more consistent with the capabilities of low precision processing elements such as neurons, where it would appear difficult to specify accurately an initial position, or a consistent evolutionary path. This bias/clamp approach raises the question, how much bias is necessary to constitute a clamp? Put another way, if c is a vector, 6 a nonnegative scalar, and 6c the bias, what is the minimum value of 6 necessary for a subset of the processing units to be clamped into a given state? Indeed we may then ask how this minimum value would change with respect to changes in network connectivities resulting, say, from learning. These questions are similar in many respects to those that are efficiently handled in linear programming using parametric sensitivity analysis based on the simplex algorithm (Dantzig 1963). Our preliminary results suggest that Lemke’s algorithm could provide the basis for a related kind of analysis for the networks considered here. Of course, as a general procedure for computing equilibria, one may alternate between Lemke’s algorithm and following integral curves, whichever is more appropriate. This approach would be analogous to the current situation in linear programming, where the simplex method provides an indispensable theoretical framework that may be supplemented by interior point methods such as Karmarkar’s algorithm (Murty 1988). We shall not concern ourselves directly with the earlier Hopfield (1982) model where the network is symmetric and the processors are binary valued. Hopfield computes equilibria for these networks with simple discrete descent methods not generally applicable to the later Hopfield (1984) model or to ours. Hopfield’s (1982) problem fits into a very interesting class of polynomial-time local search (PLS) problems (Johnson et al. 1988) and in fact is known to be polynomially complete for this class (Papadimitriou et al. 1990). If we change this discrete problem by allowing the processors to assume a bounded range of real values, then it becomes one of those which we consider. However this continuous problem is easier since its solution set is always at least as large or

170

Douglas A. Miller and Steven W. Zucker

larger (cf. final example Appendix A). Thus the PLS-hardness results of Papadimitriou et al. would therefore not appear to apply to the kinds of continuous problems which we consider. We discuss the complexity of Lemke‘s algorithm further in Section 6. 2 Overview of Paper

This paper describes a correspondence between analog artificial neural networks similar to those described by Hopfield (1984) and the branch of mathematics known as game theory. An immediate result of this correspondence is the existence of a complementary vertex pivoting algorithm, known as Lemke’s algorithm, similar in many respects to the well-known simplex algorithm for linear programming, which will compute an equilibrium for any instance of such an artificial neural network, regardless of interconnectivity, and in particular for cases where the interconnectivity is nonsymmetric. Such cases are in general avoided by Hopfield, yet are clearly of great interest for appIications such as early visual processes (Zucker et al. 1989). We will not be concerned with game theory as a whole, which originated with von Neumann and Morgenstern (19441, and for which the literature is now immense, but rather with a special branch known as noncooperative n-person games, and indeed a special branch of these known as polymatrix games. To give an overview of this paper, in Section 3 we introduce the theory of polymatrix n-person games and noncooperative equilibrium strategies. In Section 4 we describe a class of analog networks similar to those described in Hopfield (19841, the differences being (1) that our amplifiers are linear over their specified operating range, rather than asymptotic sigmoid over the real line, (2) that we make specific assumptions of lower and upper bounds on the voltages that our amplifier inputs can attain, and (3) that the interconnectivity (”synapses”) of our amplifiers need not be symmetric. The first two modifications allow us to view these networks as polymatrix n-person games, each player representing a neuron, and each player/neuron having the two game strategies “depolarize” and “hyperpolarize.” The third modification is a bonus, since polymatrix games do not require symmetry. In Section 5 we show that the bounded linear voltage amplifiers of the previous section may be replaced with bounded piecewise linear voltage amplifiers, staying within the same model, by simply adding more linear amplifiers, approximately one per linear segment per amplifier. We argue that only a small number of linear segments may be necessary for biologically plausible response curves. In Section 6 we discuss the complexity of computing equilibria for polymatrix games and hence for our analog networks. We state as a proposition the main result of the paper, that Lemke’s algorithm will

Equilibria of Nonsymmetric Analog Networks

171

compute an equilibrium for any bounded piecewise-linear Hopfield network. While purely analytic results on the probabilistic efficiency of Lemke’s algorithm (Todd 1983) are only suggestive, Lemke’s algorithm is known in practice to be of low order polynomial complexity, its computational complexity being essentially that of the well-known simplex method for linear programming (Murty 1988). Thus the applicability of Lemke’s algorithm to computing equilibria for nonsymmetric Hopfield networks strongly suggests such computations are tractable, and in particular, not NP-complete (Carey and Johnson 1979). This in turn reinforces our belief that such models may be able to capture important properties of biological computation. Indeed, as we have noted (Miller and Zucker 1991), polymatrix games with zero self-payoff terms are equivalent to relaxation labeling (Hummel and Zucker 19831, which has already been applied to modeling early visual systems (e.g., Zucker et al. 1989). In Appendices A and B we compare two quite different methods for computing equilibria, which we refer to as primal and dual. The primal method (Appendix A) amounts to following the integral curves of the dynamical system defining the network. The dual method (Appendix B) is Lemke’s algorithm, which we describe here in detail. 3 Polymatrix Games

An n-person game (Nash 1951) is a set of n players, each with a set of m pure strategies. Player i has a real-valued payoff function si(A1,. . . , A,,) of the pure strategies XI, . . . , A,, chosen by the n players. For each player there is an additional kind of strategy called mixed, which is a probability distribution on the player’s m pure strategies. A player i’s payoff for a mixed strategy is the expected value of i’s pure strategy payoff given all players choose according to their mixed strategies. Notice that as with pure strategy payoffs, mixed strategy payoffs are only meaningful in terms of all players’ simultaneous actions. A noncooperative or Nash equilibrium is a collection of mixed strategies for each player such that no player can receive a larger expected payoff by changing his/her mixed strategy given the other players stick to their mixed strategies. Nash showed using the Brouwer fixed point theorem that such equilibria always exist. However they need not be stable, as we shall discuss in Appendix A. A polymatrix game is an n-person game in which each payoff to each player i in pure strategy is of the form

where for all i, rii(Ai,&) = 0. (Here we use & to denote a strategy for i that is possibly different than AJ We may interpret rjj(Ai,A,) as i’s payoff from j given their respective pure strategies A, and A,. This implies a

Douglas A. Miller and Steven W. Zucker

172

payoff to i in mixed strategies of the form

C

(3.1)

p i ( X i ) r i j ( X i , Xj)pj(Xj)

h , j , XI

R

=

[

..' ..,

[rn(~1,~1)]

; [rnl(~n,h)l

...

[rln(XI,Xn)l

C =

[rnn(Xn,Xn)]

-

[ y)Ij [cn(Xn)l

(3.3)

Equilibria of Nonsymmetric Analog Networks

173

and let qT be the n-vector (-1,. . . , -1), then p is a vector of all players’ mixed strategies if and only if

Ap=q,

p>O

(3.4)

Also in terms of this notation we may express the gradient of 3.2 as (3.5)

Assume p satisfies 3.4 and is fixed except for player i, whose payoff is given by 3.2. Since this function is concave, and the constraint set (a simplex) is convex, a given mixed strategy for i will have a maximum payoff if and only if i’s gradient 3.5 has a vanishing projection onto the constraint set 3.4. Equivalently, pi is an optimal strategy for player i if and only if there exists no directional vector d such that dj = 0 for j # i, and

Aj.d = 0 for all Xi, pi(&) = 0 implies dj(Xj) 2 0

(3.6)

Now let all players be free to change their strategies. For p to be a Nash equilibrium, 3.6 must be simultaneously nonsatisfiable for each player i. It can be shown (e.g., Miller and Zucker 1991) that this set of simultaneous conditions is equivalent to there being no d satisfying the system

d T ( R p+c) > 0 Ad = 0 for all i, Xi, pi(Xi) = 0 implies di(Xi) 1 0 Ap=q, p 1 0

(3.7)

In view of 3.7 we now have an alternative characterization of the Nash equilibria of the polymatrix game 3.3 in terms of the equilibria of the dynamical system

p’=Rp+c Ap=q, p 2 0

(3.8)

In other words, these equilibria are precisely the points at which the vector field of 3.8 vanishes. Notice if R is symmetric then p’ is the gradient of 1 -pTR p cTp (3.9) 2 The first term in 3.9 corresponds to the average local potential in relaxation labeling (Hummel and Zucker 1983).

+

Douglas A. Miller and Steven W. Zucker

174

4 Analog Networks as Polymatrix Games

We take as our point of departure the class of analog networks defined by Hopfield (1984). These are dynamical systems defined by the equations

u; = g;'(VJ

for i = 1,.. . , n. Here u, and V , are interpreted as the input and output voltages of an instantaneous amplifier described by a continuous monotonic sigmoid function g,(u,). In addition we define ITIII as the conductance between the output of amplifier j and the input of amplifier i, we let C, be the input capacitance of i, we let I, be a fixed input bias current for i, and we define R, by (4.2) where p l is the resistance across C,. If TI]is negative, then the input to amplifier i comes from an inverting amplifier -gJ(ul). Such a network is illustrated in Figure 1. Suppose now g, is a linear function on the real interval [ a l l P l ] , (r, < 0 < PI, such that g,(al) = 0, g I ( p 1 )= 1, and that u,, 0, are also upper and lower bounds on the voltage that the input capacitor to amplifier i can attain. Thus a further input current to a saturated capacitor would produce no effect. (We shall show in the next section that this model actually includes piecewise-linear voltage amplifiers as well.) Letting 6, = (PI - a , ) be a unitless scalar these assumptions give us a new version of 4.1:

(4.3)

Rewriting this in terms of the output voltages Vi and dividing through by Sic/we obtain

O
(4.4)

Notice the amplifier gain l / S i is inversely related to the influence of the capacitance term -Vi/RiCi.

175

Equilibria of Nonsymmetric Analog Networks

I

I"

Inverting Amplifier

Vhplifier

Resistor

Figure 1: A two-node analog network. Each node or "neuron" i includes a noninverting and inverting voltage amplifier and a capacitor C, with a parallel "membrane" resistor. Each other node j may connect ("synapse") onto i via a resistance l/lTll1 from j's noninverting or inverting amplifier, respectively, depending on whether TIj is positive or negative. Node i may also have a constant input current I,. In this example T1,2 and T2,1 are negative, so that the two nodes are mutually suppressing. (Adapted from Hopfield and Tank 1985.)

If the Tij are symmetric, Hopfield (1984, p. 3090) gives a function for the dynamical system (4.1) of the form

- 1 / 2 ~ ~ T , v i v j + ~ 1 / R 0i ~ " ' g ~ ' ( V C ) dI iVV -i i

jfi

i

I

which strictly decreases with the time evolution of the system unless a n equilibrium is reached, thus showing the system cannot cycle. Defining g;' as in 4.3 and dividing through by 6,CI gives us

-1/2CCTIlV,V,/GtCI + 1 / 2 C V:/RICI I

If!

I

-C(-at/R+I,)Vt/6tC,(4.5) I

which is actually a potential function for 4.4, that is, its negative gradient projected onto the constraint set of 4.4 is the vector field of 4.4. In fact, 4.5 is really just a n instance of 3.9, for at this point it is trivial to show 4 equivalent to a dynamical system of the form 3.8, that is,

Douglas A. Miller and Steven W. Zucker

176

to a polymatrix game. The idea is, first, to associate amplifiers with players, and then, for each i, to let player i have exactly two strategies d (”depolarize”)and h (“hyperpolarize”), and to associate p i ( d ) with Vi. To do this, let (4.6)

and all other entries of R and c be zero. Using the first line of 3.8 we find

[ ;ti: ]

=

2 ( C , + ( T , / G , C ~ ) P J-~() 1 / ~ ~ ~ ~ +) [p (~- ~( dJ R )+ ~W )

(4.7)

K~)

[o We may then compute the magnitude T,(d)’of the projection of 4.7 onto the subspace p , ( d ) + p,(h) = 1 in the p , ( d ) direction by taking the dot product of 4.7 with (llfi,-l/fi), obtaining

we see that Using the simple geometrical relation ( d / d t ) p , ( d )= =,(/z)’/fi, each ( d / d t ) p , ( d )satisfies precisely the same equation as dVi/dt in 4. Thus we have an instance of a polymatrix game of the form 3.8 which is equivalent to 4. 5 Extension to Piecewise-Linear Amplifiers

In this section we show that the bounded linear amplifiers described previously can, within the same model, be used to construct, to an arbitrary degree of accuracy, any piecewise-linear amplifier with bounded input. To consider the simplest case, suppose we want a voltage amplifier gi (Fig. 2 ) whose input u,is bounded between ai and Pi, and which linearly maps the interval [ & , p i ] onto [0,1], where cyi < di < Dl < Pi. We can construct g , with two bounded linear amplifiers (5.1)

using the circuit in Figure 3. With respect to Figure 3, let pi and C , have the desired values for the input to amplifier gl, and let T;+ = f i = 1. Observe p I and C, act as a low-pass frequency filter for amplifier i. It follows that if we choose C; sufficiently small with respect to C,, we may neglect its impedance, treating the output of i as going instantaneously

Equilibria of Nonsymmetric Analog Networks

177

1

I

0.9

0.8 0.7 t i

0

a

0.6

P

0.5

3 t-

3

L

t3 0

0.4

0.3 0.2 0.1 0

B,

% INPUT VOLTAGE

0

PIECEWISE LINEAR

Figure 2: Piecewise-linear amplifier giwith one nonzero slope. through a pure voltage divider. In that case

Vj(t)T;,i- uT(t)(l/&+ T;,i) = 0

(5.2)

and hence

(5.3)

U?(t ) = Vi( t )/2

It follows from 5.1 and 5.3 that in response to any input voltage to amplifier gi, 8; and hence g i will have (as C? goes to zero) the output given in Figure 2. This procedure may be extended to more complicated piecewise-linear amplifiers g i such as in Figure 4, where we have two distinct nonzero slopes. Here the lower and upper bounds are 1yi and pi, the first nonlinearity occurs at CY,(~),the second at (which we also label CYQ)), and the third at &). To create g i , we may use the circuit in Figure 5. As before, pi and C, have the desired values associated with the input to gi. Other values are

Pi(1) Titi),;

=1

=

Pi(2) = P;

1

T@),1 . -1

Douglas A. Miller and Steven W. Zucker

178

Figure 3: Circuit giving piecewise-linear response of Figure 2. Note amplifiers gf and gf are same type as in Figure 1. and

(5.4) where LL = gi(&)) = gi(ai(2,)will be used as a weighting factor between the two nonzero slopes. Further let

(5.5)

If all other capacitors are small relative to C,, then as before we have in the limit ui(t)(t) =

Vi(f)/2*

and in addition

ui(2)(t)

= Vp(f)/2

(5.6)

Equilibria of Nonsymmetric Analog Networks

179

d

0 7 4 0.6

0.4

O.4

"d1)

Pd I I

a

INPUT VOLTAGE PIECEWISE LINEAR

Figure 4: A piecewise-linear amplifier g i with two nonzero slopes. hence It follows from 5.4-5.7 that g i will have the desired form. Notice that 5.5 uniquely determines the abscissa values of the nonlinear points of Figure 4, and that 5.4 uniquely determines the ordinate value for the transition from the first nonzero slope to the second. The reader may verify that in general this procedure can be extended to n nonzero slopes using n + 2 bounded linear amplifiers. Although there is a complexity cost in this procedure (see the next section and Appendix B), the low precision of individual neurons as processing units implies that piecewise-linear approximations with just a few linear segments are often likely to suffice, in which case the extra computational cost would be minimal. For instance in Figure 6 we compare a piecewise-linear response curve with a smooth asymptotic sigmoid g ( u ) = y /[ l exp(-u/A)]. If we assume (cf. Hopfield 1984) that the smooth sigmoid represents the mean firing rate of a spiking neuron for a 50 msec interval, that the neuron's upper firing rate is 200/sec, and that the number of firings in a time interval has a Poisson distribution,

+

Douglas A. Miller and Steven W. Zucker

180

&

put voltage

A

output voltage

Figure 5: Circuit giving piecewise-linear response of Figure 4. In general n nonzero slopes require n + 2 bounded linear amplifiers. then the upper and lower curves give the corresponding values of the asymptotic sigmoid mean plus and minus a standard deviation. It seems hard to imagine, within the kind of time periods in which neural systems compute (e.g., a few hundred msec), that one could distinguish between the dynamical behavior of a system composed of one or the other kind of low-precision amplifier. 6 Complexity of Computing Equilibria

In the previous sections we have reduced the problem of finding an equilibrium for a general analog network with bounded piecewise-linear amplifiers to that of finding an equilibrium of a polymatrix game. We therefore now address the question of how we can find such an equilibrium, and do so in a computationally efficient manner. Certainly the question is far from trivial since, as is shown in Appendix A, a primal approach such as a generalized gradient descent technique that simply follows integral curves, while potentially useful, can fail badly even in very simple nonsymmetric cases.

181

Equilibria of Nonsymmetric Analog Networks

14

,

1

0

0 \

+

PIECEWISE

INPUT VOLTAGE SIGMOID 0

-so

A

+s?

Figure 6: A biological frame of reference for comparing a smooth and a piecewise-linear response curve. We assume (cf. Hopfield 1984) that the sigmoid represents the mean firing rate of a spiking neuron for a 50 msec interval, that the upper firing rate is 200/sec, and that the number of firings in a time interval has a Poisson distribution. The upper and lower curves give the corresponding values of the asymptotic sigmoid mean plus and minus a standard deviation. Thus the piecewise linear curve is well within the likely statistical behavior of a neuron with the given asymptotic sigmoid mean firing rate. In this section we describe an alternate view of this problem, a dual approach, to borrow from mathematical programming terminology, which will provide us with an algorithm for computing an equilibrium for any problem instance, in a time complexity which, while not deterministically polynomial, appears to be polynomial at least in a very strong probabilistic sense. Let us first redefine R and c by adding a sufficiently small negative constant k to each term so that

R<0,

c
(6.1)

Notice this does not alter the Nash equilibria, since each player receives an identical penalty ( n 1)k regardless of strategy. On the other hand,

+

Douglas A. Miller and Steven W. Zucker

182

6.1 implies each player’s payoff gradient 3.5 is negative. This permits us to relax 3.4 and replace it with

ApIq, p20 (6.2) (To see this, observe that for some i, letting Ai. be the ith row of A, if Ai.p’ < qi, then for some t > 0, p’ = (1 - e)p; is a feasible preferred strategy for i given the other players’ strategies remain unchanged. Thus all equilibria must still satisfy 3.4.) We now describe our principal analytic tool, a variant of the wellknown theorem of Kuhn and Tucker (1951). The idea is to replace the requirement of the nonexistence of d in 3.7 with the requirement of the existence of a pair of vectors of dual variables or Kuhn-Tucker multipliers y, u. The n multipliers y correspond to the n constraints Ap 59, and the mn multipliers u correspond to the mn constraints p 1 0. Now it can be shown (e.g., Miller and Zucker 1991) that finding p and v satisfying our original equilibrium conditions 3.7 is equivalent to finding p7y, u , v that satisfy the system

where v is an n-vector, I,,, and I, are identity matrices of size mn and n, and R, p , c, q are as in Section 3. We refer to v as a vector of slack variables for the constraints 6.2. This is because 6.3 implies

p,v10 (6.4) which is of course equivalent to 6.2. Similarly p may be viewed as a set of slack variables for the n constraints p 2 0. The third line of 6.3 may then be interpreted as stating that if a slack variable is positive, its Kuhn-Tucker multiplier is zero. The first two lines of 6.3 represent a set of linear equalities and linear inequalities of the kind found in the well-known linear programming problem (e.g., Dantzig 1963). Although solving such problems is far from trivial, they are known to be of polynomial computational complexity in terms of the problem specification size, as opposed NP-complete problems, which are for all practical purposes of exponential complexity (Garey and Johnson 1979). The situation changes dramatically, however, when we consider the third line of 6.3. The problem then becomes an instance of the linear complementarity problem, which in general is NP-complete (Garey and Johnson Ap+~=q,

Equilibria of Nonsymmetric Analog Networks

183

1979). Furthermore, although the theory of NP-completeness applies to digital computation (specifically, Turing machines) the intuitively attractive thesis has been proposed that any analog machine (such as Hopfield and Tank 1985)for solving NP-complete problems would necessarily consume exorbitantly large physical resources (Vergis et al. 1986). (Note Hopfield and Tank do not actually claim to solve the NP-complete traveling salesman problem with an analog device, but merely to find approximate solutions.) Thus it seems very difficult to accept the idea that any biologically plausible system could be based on solving an NP-complete problem. In particular, if finding an equilibrium for the kinds of polymatrix games we are interested in is an NP-complete problem, it is hard to imagine such a model could be biologically useful. (See also Kirousis and Papadimitriou 1988 and Tsotsos 1988.) There is a possible way out of this impasse, in that 6.3 has a special structure, and linear complementarity problems with special structure may still be solvable in polynomial time. For instance if R is negative semidefinite, then 6.3 may be solved in polynomial time by an algorithm similar to the ellipsoid algorithm for linear programming (Adler etal. 1980). This class of Rs has two important subclasses. The first consists of those R that are also symmetric, in which case solving 6.3 amounts to solving a convex quadratic program. The second subclass consists of those R for which the diagonal submatrices [rii(Aj,&)] are zero for each i, and for which R = -RT. Such an R is trivially negative semidefinite, and defines a zero sum polymatrix game. It follows that equilibria for such games may be computed in polynomial time. To the authors’ knowledge, the complexity of computing general polymatrix game equilibria, that is, linear complementarity problems of the form 6.3, is an open question. There are, however, results that at least make it appear unlikely this problem is NP-complete. In particular, 6.3 may be solved by an algorithm belonging to a family of vertex pivoting algorithms (the most well-known member being the simplex method for linear programming) that tend to be extremely fast in practice but that can, in certain artificial cases, require exponential time (Murty 1988, p. 162; Cottle 1980). This algorithm, known as Lernke’s algorithm, is described in detail in Appendix B. It was not originally used for polymatrix games, and the fact that it could be was first recognized by Eaves (19731, although in a different form than 6.3 (see also Appendix B for a discussion of Eaves‘ method). Independently of Eaves’ result, we showed Lemke’s algorithm could also be applied in a form known as copositive-plus, of which 6.3 together with the assumption 6.1 is an example (Miller and Zucker 1991). Given that we may use Lemke’s algorithm for polymatrix game equilibria, if we extrapolate from linear programming, for which true polynomial algorithms were found after decades of experience with the simplex

184

Douglas A. Miller and Steven W. Zucker

method, the latter invariably polynomial in practice, then it seems reasonable to hope that true polynomial algorithms exist for polynomial game equilibria as well. Moreover, from the point of view of the present paper, what is significant is that in practice 6.3 is solvable in polynomial time, just as with NP-complete problems it is only known that in practice they are not solvable in polynomial time. IT0 know the latter with certainty would mean solving the famous ”P = NP?” problem (Garey and Johnson 19791.1 Indeed the above probabilistic view of the efficiency of Lemke’s algorithm has been given a rigorous form. Todd (1983) has shown, under an extremely broad class of joint probability distributions on the numerical components of the linear complementarity problem, that the expected number of pivots (see Appendix B) of a particular form of Lemke’s algorithm is bounded above by N(N + 1)/2, where N is the number of linear equations, and hence the total expected computation is polynomial. While problems of the form 6.3 can only be regarded as a subpopulation of a population of problems for which this result holds, still Todd’s analysis is highly encouraging. We conclude this section with two propositions. First, since we have reduced our modified Hopfield network (4.3) to a two-strategy polymatrix game (3.8) we may state Proposition 1. The complexity of computing an equilibrium for an analog network with piecewise-linear amplifiers is bounded by the complexity of computing an equilibrium for a polymatrix game. Finally, since Lemke’s algorithm solves the linear complementarity problem 6.3, which is equivalent to finding an equilibrium for 3.8, we conclude Proposition 2. Lemke‘s algorithm computes an equilibrium for an analog network with piecewise-linear amplifiers, for arbitray interconnections, amplifier gains, and input biases.

7 Appendix A: Computing Equilibria with Integral Curves

In this appendix we shall discuss a primal algorithm for computing polymatrix game equilibria, that is, an algorithm which attempts to solve 3.8 directly. The primal method we shall be concerned with is in a sense the most obvious method, and the only one plausible biologically. It amounts to following the integral curves of the dynamical system 3.8 defining the network (cf. the relaxation labeling algorithm of Hummel and Zucker 1983). Although Hopfield and Tank (1985) have built physical devices to imitate

Equilibria of Nonsymmetric Analog Networks

185

such systems, this algorithm can of course be attempted numerically, for instance by a piecewise smooth version of Euler's method, each "piece" corresponding to a new face of the polytope constraint set. If R is symmetric this is similar to the projected gradient method of mathematical programming (Luenberger 1973, p. 247). Specifically, given a point pk, the algorithm may be described as follows: Iteratively compute pk+' by letting dk + pk be the projection of Rpk + c onto the constraint set A p = q, and letting dk+ pk be the projection of Rpk + c onto A p = q together with all constraints pi = 0 for which both pk = 0 and dk < 0. Then let pk+' = pk nkdkfor some scalar ak > 0. If dk = 0, the algorithm terminates and pk is an equilibrium. There are many critical issues in implementing this method, such as the choice of ak,the manner of computing the projection (trivial in the two-state case), and the choice of a stopping rule, since in general dk = 0 will be a numerical impossibility, unless, as is common in optimization procedures, we use the proximity of the pk to a particular vertex to correctly guess that that vertex is an equilibrium. As Luenberger (1973, p. 251) notes, a major source of potential difficulty with the primal algorithm is the discontinuous behavior of the vector field on the polytope boundary, although he also states this is not a problem in practice for the projected gradient method. However, from a theoretical standpoint the absence of a global Lipschitz condition

+

where ~ ( pis) the projected vector field at p and X 2 0 a constant, makes it difficult to say much about the complexity of the primal method, except for a given smooth segment belonging to the interior of a given polytope face. For such a segment 4 ( t ) , to 5 t 5 tl (where the Lipschitz condition does hold) we can show (cf. Vergis et al. 1986, p. 108) that an approximation to the actual curve can be computed to any accuracy E > 0 with . at least with a number of steps which is polynomial in 1 / ~ However, Euler's method, the number of steps required for a given accuracy c may be exponential in ( t l - to). If R is not symmetric there is perhaps a deeper problem than discontinuity and numerical approximation, namely the question of convergence of the integral curve itself. Consider for example the zero sum game given by

Douglas A. Miller and Steven W. Zucker

186

We can compute the magnitude of the projection of the vector [PI

(4'. (Pl@)'I

=

[172(d)r1 - Pz(4l

(7.2)

onto the subspace pl(d)+ p ~ ( k )= 1 by taking the dot product of 7.2 with (l/&.-I/&), giving &'p2(d) - l/&. Dividing by &' gives the projected derivative of pl(d) as a function of p2(d), namely

Similarly we have

Substituting, we get the second order linear equation

whose solutions are of the form N

sin(t) + Bcos(t) + 1/2

Similarly for solutions of pl(d). Thus 7.1 has just one convergent solution, namely the constant curve corresponding to the unique equilibrium pl (d)= pz(d) = 1/2. All other curves not touching the polytope boundary will follow a closed elliptical contour and never converge. Thus the primal algorithm need not find an equilibrium for 3.8 in its general form. As with the discontinuity and numerical questions, however, we interpret this as implying that one should choose biologically plausible instances of 3.8, rather than reject the primal method. We conclude this section by noting that if we change the -1s to Is in 7.1, we have the solution (i'

exp(t) + /j exp(-t)

+ 1/2

(7.3)

which gives rise to an unstable saddle equilibrium at pl(d) = pz(d) = 1/2. Thus from a purely analytic viewpoint we also have the possibility ( t v = 0) of the solution curve reaching an unstable equilibrium. However, in practice the first term in 7.3 eventually will dominate, and the system will reach one of the two stable equilibria [pl(d),p2(d)]= (1.1)or [pi (4, pz(d)] = (0.0). 8 Appendix B: Lemke's Algorithm

We now describe a dual algorithm for 3.8, that is, an algorithm that directly solves the equivalent dual problem 6.3. This procedure, known as Lemke's algorithm, is one of a family of vertex pivoting algorithms

Equilibria of Nonsymmetric Analog Networks

187

that has been developed for the linear complementarity problem, of which 6.3 is an example. (See Murty 1988 for a comprehensive survey. Also Cottle and Dantzig 1968.) The central algebraic operation of all these algorithms is a pivot or basis change. The operation acts on a system linear equations Ax

=b

(8.1)

where A is an M x N matrix (M < N),and presupposes an M x M identity matrix IM distributed among the columns of A (the basis columns). A pivot is then a choice of a nonbasic column r and a basic column s, and a corresponding multiplication of A, b by an M x M matrix B such that B A contains an identity matrix in column r and the m - 1 columns of the former basis excluding s. The significance of a basis is that a solution to A x = b is trivial since if I M is, say, the first M columns of A, then we need just let xi = bi for i = 1,...,M , and xi = 0 for i = M + l , . . .,N. A key feature of pivoting is that once a column is chosen to enter the basis, the requirement that b and Bb be nonnegative will (under certain mild nondegeneracy conditions) uniquely determine which column will leave. With Lemke’s algorithm we associate 8.1 with the first line of 6.3 and then add to A a temporary “artificial column” z of negative numbers (say -1s) corresponding to a new variable 20. A special initial pivot brings zo into the basis and causes the right-hand side of 8.1 to become nonnegative. All subsequent pivots will maintain this nonnegativity, thus satisfying the second line of 6.3 and also determining which column can leave the basis. Satisfying the third line of 6.3 (the “complementarity” condition) will determine which column can enter, and thus a unique pivoting sequence is specified. The algorithm terminates in a solution when the first line of 6.3 is satisfied as well, which coincides with zo leaving the basis. In general Lemke’s algorithm may not terminate in a solution. This occurs when the new pivoting column is nonpositive (geometrically an infinite ray), thus making it impossible to pivot into that column and preserve the nonnegativity constraints. A critical issue in Lemke’s algorithm is therefore specifying sufficient conditions on the structure of the linear complementarity problem being solved to ensure that 1. either a termination in a solution occurs, or 2. termination in a ray implies there exists no solution. [In the case of polymatrix game equilibria a solution always exists, so the above conditions (1)and (2) imply that a solution is computed.] In Miller and Zucker (1991) we have observed that polymatrix games may be put in such a form, a special case of a class defined by Lemke (1965) and later known as copositive-plus (Cottle and Dantzig 1968). With respect to 6.3, the essential condition for this result is that we may assume R < 0.

Douglas A. Miller and Steven W. Zucker

188

A result by Eaves (1973) shows that polymatrix games fit into another class of linear complementarity problem of the same general form as 6.3, for which Lemke’s algorithm is also guaranteed to terminate successfully. In this case there is no requirement that R < 0. However it is necessary that Ap 5 9 include the special constraint eTp 5

/c

where e is a vector of Is, and /c is a variable that is treated during each pivoting operation as though it were arbitrarily large in relation to the absolute values of any numbers used to specify the problem. As with the simplex method for linear programming, Lemke’s algorithm may take an exponential number of pivots to solve certain linear complementarity problems (e.g., Cottle 1980). However, as far as we are aware, all cases where exponential behavior has been demonstrated were specifically created for that purpose. When applied to real world or simulated random problems, the typical number of pivots needed for Lemke’s algorithm to terminate is O(M) (Murty 1988, p. 1621, or in the present case O(mn). Since each pivot may be accomplished in O(M2) arithmetic operations, this implies an empirical bound of O(rn3n3)arithmetic operations. If, as in the present case, m is fixed, then we get O(n3).

Acknowledgments The authors thank Frank Ferrie and David Jones for valuable criticism and suggestions. This research was supported by grants from NSERC and AFOSR. S. W. Z. is a Fellow, Canadian Institute for Advanced Research.

References Adler, I., McClean, R. I?, and Provan, J. S. 1980. An application of the KhachiyanShor algorithm to a class of linear complementarityproblems. Cowles Foundation Discussion Paper 549,Yale University, New Haven, Connecticut. Adler, I., Megiddo, N., and Todd, M. J. 1984. New results on the average behavior of simplex algorithms. Bull. A m . Math. Soc. (N.S.) 11, 378-382. Cottle, R. W. 1980. Observations on a class of nasty linear complementarity problems. Discrete Appl. Math. 2, 89-111. Cottle, R. W., and Dantzig, G. B. 1968. Complementary pivot theory of mathematical programming. In Mathematics of the Decision Sciences, G. B. Dantzig and A. F. Veinott, Jr., eds., Part I, pp. 115-136. AMS. Dantzig, G. 8. 1963. Linear Programming and Extensions. Princeton University Press, Princeton, NJ. Eaves, B. C. 1973. Polymatrix games with joint constraints. S l A M 1.A p p l . Math. 24, 418423.

Equilibria of Nonsymmetric Analog Networks

189

Garey, M. R., and Johnson, D. S. 1979. Computers and Intractability. W. H. Freeman, San Francisco. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., Vol. I, pp. 282-317. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Hopfield, J. J., and Tank, D. W. 1985. 'Neural' computation of decisions in optimization problems. Biol. Cybernet. 52, 1-12. Hummel, R. A., and Zucker, S. W. 1983. On the foundations of relaxation labeling processes. I E E E PAMI 5, 267-287. Johnson, D. S., Papadimitriou, C. H., and Yannakakis, M. 1988. How easy is local search? J . Comput. Syst. Sci. 26, 79-100. Kinderlehrer, D., and Stampacchia, G. 1980. A n Introduction to Variational Inequalities and Their Applications. Academic Press, New York. Kirousis, L. M., and Papadimitriou, C. H. 1988. The complexity of recognizing polyhedral scenes. J. Comput. Syst. Sci. 37, 14-38. Kuhn, H. W., and Tucker, A. W. 1951. Nonlinear programming. In Second Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman, ed., pp. 481492. University of California Press, Berkeley, CA. Lemke, C. E. 1965. Bimatrix equilibrium points and mathematical programming. Management Sci. 11, 681-689. Luenberger, D. G. 1973. lntroduction to Linear and Nonlinear Programming. Addison-Wesley, Reading, MA. Miller, D. A., and Zucker, S. W. 1991. Copositive-plus Lemke algorithm solves polymatrix games. Operations Res. Lett. 10, 285-290. Murty, K. G. 1988. Linear Complementarity, Linear and Nonlinear Programming. Heldermann Verlag, Berlin. Nash, J. F. 1951. Noncooperative games. Ann. Math. 54, 286-295. Papadimitriou, C. H., Schaffer, A. A., and Yannakakis, M. 1490. On the complexity of local search. Proceedings of the 22nd Annual ACM Symposium on the Theory of Computing, Baltimore, Maryland, May, 438445. Pour-El, M. B., and Richards, I. 1981. The wave equation with computable initial data such that its unique solution is not computable. Adv. Math. 39, 215-239. Sejnowski, T. J. 1981. Skeleton filters in the brain. In Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, eds., pp. 189-212. Lawrence Erlbaum, Hillsdale, NJ. Todd, M. J. 1983. Polynomial expected behavior of a pivoting algorithm for linear complementarity and linear programming problems. Tech. Rep. 595, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, New York. Tsotsos, J. 1988. A 'complexity-level' analysis of intermediate vision. Int. J. Comput. Vision 1,303-320.

190

Douglas A. Miller and Steven W. Zucker

Vergis, A., Steiglitz, K., and Dickinson, B. 1986. The complexity of analog computation. Math. Comput. Simulation 28, 91-113. von Neumann, J., and Morgenstern, 0. 1944. Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. B i d . Cybernet. 58, 63-70. Zucker, S. W., Dobbins, A,, and Iverson, L. 1989. Two stages of curve detection suggest two styles of visual computation. Neural Cornp. 1, 68-81. ~~

Received 23 October 1990; accepted 10 May 1991.

This article has been cited by: 1. Marcello Pelillo , Andrea Torsello . 2006. Payoff-Monotonic Game Dynamics and the Maximum Clique ProblemPayoff-Monotonic Game Dynamics and the Maximum Clique Problem. Neural Computation 18:5, 1215-1258. [Abstract] [PDF] [PDF Plus] 2. Alessio Massaro, Marcello Pelillo, Immanuel M. Bomze. 2002. A Complementary Pivoting Approach to the Maximum Weight Clique Problem. SIAM Journal on Optimization 12:4, 928. [CrossRef] 3. Marcello Pelillo . 1999. Replicator Equations, Maximal Cliques, and Graph IsomorphismReplicator Equations, Maximal Cliques, and Graph Isomorphism. Neural Computation 11:8, 1933-1955. [Abstract] [PDF] [PDF Plus] 4. Douglas A. Miller , Steven W. Zucker . 1999. Computing with Self-Excitatory Cliques: A Model and an Application to Hyperacuity-Scale Computation in Visual CortexComputing with Self-Excitatory Cliques: A Model and an Application to Hyperacuity-Scale Computation in Visual Cortex. Neural Computation 11:1, 21-66. [Abstract] [PDF] [PDF Plus] 5. M. Pelillo, K. Siddiqi, S.W. Zucker. 1999. Matching hierarchical structures using association graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:11, 1105-1120. [CrossRef]

NOTE

Communicated by Scott Kirkpatrick

A Volatility Measure for Annealing in Feedback Neural Networks Joshua Alspector Torsten Zeppenfeld' Stephan Luna2 Bellcore, Morristown, NJ 07962-1910 USA

In feedback neural networks, especially for static pattern learning, a reliable method of settling is required. Simulated annealing has been used but it is often difficult to determine how to set the annealing schedule. Often the specific heat is used as a measure of when to slow down the annealing process, but this is difficult to measure. We propose another measure, volatility, which is easy to measure and related to the Edwards-Anderson model in spin-glass physics. This paper presents the concept of volatility, an argument for its similarity to specific heat, simulations of dynamics in Boltzmann and mean-field networks, and a method of using it to speed up learning. 1 Volatility and Specific Heat

The Boltzmann machine (Ackley et al. 1985)can be described by an energy function 1 E = -2

C C wjjsjsj i

(1.1)

j#i

appropriate for the Hopfield model -(Hopfield 1982). The si represent binary state neurons. The process of simulated annealing (Kirkpatrick et al. 1983) helps the network arrive at a low energy configuration. In the mean field approximation (Peterson and Anderson 1987), instead of binary neurons, the activation is Dj = tanh

OM,

(1.2)

'Permanent address: School of Computer Science, Carnegie Mellon University, Schenley Park, Pittsburgh, PA 15213. 2Permanent address: Department of EECS, University of California, Berkeley, CA 94720. Neural Computation 4,191-195 (1992)

@ 1992 Massachusetts Institute of Technology

192

J. Alspector, T. Zeppenfeld, and S. Luna

and we get an expression for the average energy

(E)= -

C C ~ i j v i v =j - C u .u. 1

i<j

1

(1.3)

I

where we have substituted the net input ui. We would like to know when the system has settled. Intuitively, at high temperature, binary neurons will be volatile and their states will be equally likely to be +1 as -1. The average state would be about 0. As the temperature is lowered, the states lock into either +1 or -1 and stay there. If one takes the absolute value of neural states or their square, the average would be 1. A quantity that describes this concisely is what we call the volatility

where the sum is taken over the unclamped neurons, which number N. This quantity is 0 at high temperature and 1 at low temperature. It is similar to the Edwards-Anderson order parameter (Edwards and Anderson 1975) from spin-glass physics although the motivation is different. This is also a valid intuitive measure in the mean field approximation 1 9 = -xu;? (1.5)

N

and has been used to judge phase transitions in optimization problems (Peterson and Soderberg 1989). In the process of going from high T to low T , the system goes from disorder to order undergoing one or more phase transitions. It is at the phase transition points that annealing must be slowed so that the system can more carefully search for a low energy configuration. A method for determining these transition points is to use the specific heat

which has peaks at these points due to fluctuations in the energy (see equation 2.1). At high temperature in the mean-field approximation (0-+ 01, equation l .2 becomes

vi = pui

(1.7)

and the average energy (equation 1.3) is

(E)=

1 --XU’ -NT9 P I =

(1.8)

This shows that the volatility, 9, rises at the same point as the average energy ( E ) as the temperature decreases. Therefore the first peak in the specific heat can be expected to be at the same place as the first rise in the volatility. The specific heat per neuron is closely related to the volatility

Annealing in Feedback Neural Networks

193

at high temperature as can be seen from the expression

2 Simulations

We have shown, by simulation in previous studies (Alspector et al. 1990) that Boltzmann and mean-field networks can have powerful learning and representation properties just like the more thoroughly studied backpropagation methods. In these studies, we used as benchmarks the parity and replication (identity) problems. Figure 1 shows volatility and specific heat plots for a 5 input, 5 hidden, 5 output (5-5-5) replication problem for Boltzmann learning. We used equation 1.4 to calculate the volatility at three different points during learning. We used an adaptive annealing schedule whereby the volatility was calculated after each temperature step. If it changed by more than 0.2, we went back to the previous step and cut the step size in half. Note the three transition temperatures for beginning, middle, and end of learning. The transition temperatures increase as the weights increase during learning. To check on the reasonableness of these temperatures, we also calculated the specific heat at each temperature using the formula (2.1)

where we calculated E using equation 1.1and did a Monte Carlo average for 50 update cycles after equilibration for 11. This formula is valid for the Boltzmann-Gibbs distribution at equilibrium. Even though the data is noisy due to these averages, we can see that the volatility transitions occur at about the same place as the peaks in specific heat. However, even for Monte Carlo averages, volatility is much easier to measure than specific heat and does not depend explicitly on the weights. The volatility curves for mean-field gain annealing are much smoother than for stochastic annealing. The ease of measuring volatility makes it feasible to use as an adaptive means of setting the annealing schedule as we did for the measurements in Figure l so that more time can be spent at the transition points. Because of the smoothness of the volatility curves in mean-field, it is possible to avoid annealing altogether and measure the correlations at a particular value of the volatility. We set the mean-field gain so that the network would be at a particular average volatility and learned at that temperature with no annealing. We only annealed occasionally (without learning) for a few patterns to see if the temperature for a particular volatility target changed as the weights changed during learning. It seems that q = 0.2-0.5 gives the best results as a volatility target for

++ +d

P

0

0 4 0

0

t

o

4

Annealing in Feedback Neural Networks

195

5-5-5 replication. This is clearly problem dependent since a value of 0.95 is best for 5-10-1 parity. Our single temperature mean-field learning was 5 times faster than annealed mean-field and 30 times faster than Boltzmann learning but about 3 times slower than backpropagation. Learning quality was similar to our previous results (Alspector et al. 1991). 3 Conclusion

We have established a theoretical basis for the volatility measure, q, to substitute for the specific heat in annealing. Simulations have verified the validity of this measure and shown how to use it to speed u p annealing and learning. This quantity is far easier to measure than specific heat because only the knowledge of neural states and not the weights are needed.

Acknowledgments The authors are grateful for valuable discussions about order parameters and spin-glass physics with R. Meir and J. Schotland. We are also grateful to C. Peterson for pointing out the previous use of this measure (Peterson and Soderberg 1989) in mean-field simulations. This work has been partially supported by AFOSR contract F49620-90-C-0042, DEE

References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147-169. Alspector, J., Allen, R. B., Jayakumar, A., Zeppenfeld, T., and Meir, R. 1991. Relaxation networks for large supervised learning problems. In Advances in Neural Information Processing Systcms 3, R. P. Lippmann, J. Moody, and D. Touretzky, eds. Morgan Kaufmann, Palo Alto, CA (Proceedings NIPS '90, Denver, CO), p. 1015. Edwards, S. F., and Anderson, P. W. 1975. J. Phys. F 5, 965. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A.79, 2554-2558. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Peterson, C., and Anderson, J. R. 1987. A mean field learning algorithm for neural networks. Complex Syst. 1(5), 995-1019. Peterson, C., and Soderberg, 8. 1989. A new method for mapping optimization problems onto neural networks. I n t . J. New. Syst. 1, 3-22. Received 7 June 1991; accepted 19 August 1991.

This article has been cited by: 1. B. Truyen, N. Langloh, J. Cornelis. 1994. An adiabatic neural network for RBF approximation. Neural Computing & Applications 2:2, 69-88. [CrossRef]

Communicated by Graeme Mitchison

What Does the Retina Know about Natural Scenes? Joseph J. Atick' A. Norman Redlich School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540 U S A

By examining the experimental data on the statistical properties of natural scenes together with (retinal) contrast sensitivity data, we arrive at a first principles, theoretical hypothesis for the purpose of retinal processing and its relationship to an animal's environment. We argue that the retinal goal is to transform the visual input as much as possible into a statistically independent basis as the first step in creating a redundancy reduced representation in the cortex, as suggested by Barlow. The extent of this whitening of the input is limited, however, by the need to suppress input noise. Our explicit theoretical solutions for the retinal filters also show a simple dependence on mean stimulus luminance: they predict an approximate Weber law at low spatial frequencies and a De Vries-Rose law at high frequencies. Assuming that the dominant source of noise is quantum, we generate a family of contrast sensitivity curves as a function of mean luminance. This family is compared to psychophysical data. 1 The Retina and the Visual Environment __

An animal must have knowledge of its environment. As Barlow (1989) has emphasized, one important type of knowledge that needs to be stored in the brain is knowledge of the statistical properties of sensory messages. This provides an animal with data about the regular structures or features in its environment. New sensory messages can then be compared to expectations based on this background data; for example, the background data can be subtracted. In this way, one can argue, the brain is able to discover unexpected events and new associations. Here we explicitly explore the possibility that even the retina knows some of the statistical properties of visual messages. Our prejudice is that discovering how this information is used in the retina will not only help explain retinal processing but will be invaluable in applying this idea to the cortex. To discover what the retina knows about the statistics of its environment, it is first necessary to find out just what characterizes the ensemble 'Address after July I, 1992: The Rockefeller University, 1230 York Ave., New York, NY 10021, USA. Neural Computation 4, 196-210 (1992) @ 1992 Massachusetts Institute of Technology

Retina and Natural Scenes

197

of visual messages in a natural environment. An important step in this direction has been taken by Field (19871, who has been analyzing pictures of "natural" scenes, such as landscapes without human-made objects as well as pictures of human faces. As Field has argued, these represent a very small subset of all possible images: all possible arrangements and values of a set of pixels. What he found is that naturaI images have unique and clearly defined statistical properties. The first statistical measure Field calculated is the two-dimensional spatial autocorrelator R(X,Y)

=(WL(Y))

(1.1)

which is defined as the average over many scenes (or the average over one large scene assuming ergodicity) of the product of luminance levels L ( x ) and L(y) at two spatial points x and y. Actually, by homogeneity of natural scenes the autocorrelator is only a function of the relative distance: R ( x - y). One can thus define the spatial power spectrum, which is the Fourier transform of the autocorrelator R(f) = Jdxe","R(x). This is the quantity that Field directly measured. What he found is

R(f)

- If?

1

-

which corresponds to a scale invariant autocorrelator: under a global rescaling of the spatial coordinates x a x the autocorrelator R ( a x ) -+ R ( x ) . Although this scale invariant spatial power spectrum is by no means a complete characterization of natural scenes, it is the simplest regularity they possess. The retina, being the first major stage in visual processing, is not expected to have knowledge beyond the simplest aspects of natural scenes and hence for understanding the retina the power spectrum may be sufficient. The question at this stage is what is the relationship between this property of the visual environment and the observed visual processing by the retina? To answer this, let us explore what happens to the spatial power spectrum of the visual signal after it is processed by the retina. The output of one major class of retinal ganglion cells' is known to be related to the light input approximately through a linear filter: --f

O(xl)= / d x K(x, - x ) L(x) = K . L

(1.2)

where L ( x ) is the light intensity at point x, O ( x j ) is the output of the jth ganglion cell, and K(xj - x) is the linear ganglion cell kernel (xi is the center of the cell's receptive field. Here we assume translation invariance of the kernel K, which means that all ganglion cell kernels are the same function, but translated on the retina). Once adapted to bright light, this ganglion cell kernel, in spatial frequency space, is a bandpass filter. 2X-cells in cat, P-pathway cells in monkey.

Joseph J. Atick and A. Norman Redlich

198

Typical retinal filters at high luminosity are shown in Figure 1A and C: where the experimental responses K(f) [actually the contrast sensitivity which is K(f) times the mean luminance 101are plotted against stimulus frequency. The data shown in Figure 1A are from De Valois et al. (1974), while the data in Figure 1C are from Kelly (1972). Now to see how the power spectrum is modified by the retina, we need only multiply the input spectrum R(f) by K(f)K*(f) since the average output spectrum is (O(f)O*(f)) = ((K(f)L(f))(K(f)L.(f))*). We can also plot the square root of this output spectrum - the amplitude spectrum - simply by multiplying the experimentally measured kernels K(f) in Figure 1A and C by the input amplitude spectrum

&f)

= lfl-1

This has been done in Figure 1B and D, which shows an intriguing result. At low frequencies, the input spectrum lfl-2 is converted into a flat spectrum at the retinal output: (O(f)O*(f)) constant. This whitening of the input by the retina continues up to the frequency where the kernels in Figure 1A and C peak. Had this whitening continued up to the system's cutoff frequency, this would have meant the ganglion cell outputs would be completely decorrelated in space. This is because a white or flat spectrum in frequency space Fourier transforms into a delta function in space, giving (O(xi)O(x,)) 6,. In other words, the signals on different ganglion cell nerve fibers would be statistically independent. So it appears that the retina is attempting to decorrelate its input, at least down to the scale of the peak frequency. The idea that the brain is attempting to transform its sensory input to a statistically independent basis has been suggested by Goodall (1960) and Barlow (1989) (see also Barlow and Foldiak 1989), and has been discussed by many others. Barlow has emphasized that one advantage of having a statistically independent set of outputs Oi is that all of their joint probabilities Pi+.. can be obtained directly from knowledge of the relatively small set of individual probabilities Pi. The values of the individual Pi can also be represented by taking the output strengths 0,to be proportional to their improbability, - log(Pi), that is, to the amount of information in each output. This then gives a very compact representation of not only the signals, but also their probabilities. In such a statistically N

3Actually, what is plotted in Figure 1A and C are the results of psychophysical contrast sensitivity measurements, rather than of single ganglion cell responses. The singlecell results, however, are qualitatively similar, and in this short paper for conciseness we compare theory exclusively to psychophysical results (all figures). In general, we believe that the psychophysical data represent an envelope of the collection of singlecell contrast sensitivities. Then, given our assumption of translation invariance, the psychophysical envelope and the single-cell results should coincide. However, we do not exclude the possibility of a more complicated relationship between psychophysical and single-cell contrast sensitivities.

Retina and Natural Scenes

100d

304

199

-

lo00

4

FA - . ... . ,

30

.. .

l0I 3

-

'!L .l

1 .1

.3

1

3

10

30

100

.3

10

3

1

30

loo

lo00

m[

.".... .,, .... . .. ...' .

I 30

.

*

9 . .

10

3

Figure 1: Retinal filters (A, C) in Fourier space at high mean luminosities, taken from the contrast sensitivity data of De Valois et al. (1974) (A) and Kelly (1972) (C).B (D) is the data in A (C) multiplied by l/jfl, which is the amplitude spectrum of natural scenes. This gives the retinal ganglion cells' output amplitude spectrum. Notice the whitening of the output at low frequencies. The ordinate units are arbitrary. independent basis, the outputs 0, represent "features," for example, in English text they would correspond roughly to "words"; they are the statistical structures that carry useful information. Finding these features effectively reduces the redundancy in the original sensory messages, leaving only the so-called "textual" (not predictable) information. One may therefore state this goal of statistical independence in information theory Ianguage as a type of redundancy reduction. Based on the experimental evidence in Figure 1B and D, one might advance the hypothesis that the goal of the retinal processing is to produce a decorrelated representation of an image. However, this cannot be the only goal in the presence of input noise such as photon noise or biochemical transduction noise. In that case, decorrelation alone would

200

Joseph J.Atick and A. Norman Redlich

be a very dangerous computational strategy as we now illustrate: If the retina were to whiten all the way up to the cutoff frequency or resolution limit, the kernel K(f) would be proportional to If1 up to that limit. This would imply a constant average squared response KRK' to "natural'' signals L ( x ) , which for R N If/-* have large spatial power at low frequencies and low power at high frequencies. But this same K(f) N If1 acting on input noise whose spatial power spectrum is approximately flat (noise is usually already decorrelated) has a very undesirable effect, since it amplifies the noise at high frequencies where noise power, unlike signal power, is not becoming small. Therefore, even if input noise were not a major problem without decorrelation, after complete decorrelation (or whitening up to cutoff) it would become a problem. Also, if both noise and signal are decorrelated at the output, it is no longer possible to distinguish them. Thus, if decorrelation is a strategy, there must be some guarantee that no significant input noise is passed through the retina to the next stage. Further evidence that the retina is concerned about not passing significant amounts of input noise is found in experiments in which the mean stimulus luminance is lowered. In response to this change, the ganglion cell kernel K(f) makes a transition from bandpass to lowpass filtering. This is just the type of transition expected if the kernel is adapting to a lower signal to noise ratio, since lowpass filtering is a standard signal processing technique for smoothing away noise. Such a bandpass to lowpass transition also occurs when the temporal modulation frequency of the stimulus is increased (the retinal kernel is actually a function of both the spatial frequency f and the temporal frequency w,which has up to now been suppressed). In this case too there is an effective decrease in the spatial signal to noise ratio, so it is also evidence for noise suppression. In a previous paper (Atick and Redlich 1990) we found an information theoretic formalism that unifies redundancy reduction and noise suppression. That formalism predicts all the qualitative aspects of the experimental data. However, it is highly technical and uses parameters that do not seem to have clear physical roles. This makes it more difficult to do quantitative comparisons with experiments, since the necessary dependence of these parameters on, for example, mean luminance is not intuitive. In this paper we adopt a modular approach where noise suppression and redundancy reduction are done in separate stages. This has two advantages: first it produces parameters with more direct physical meaning, and second it gives a clearer theoretical understanding of the purpose of retinal processing. In the next section we formulate our theory mathematically making more concrete the heuristic notions of decorrelation and noise suppression. We then derive a simple theoretical retinal transfer function, and compare it to experiments.

Retina and Natural Scenes

201

2 Decorrelation as a Computational Strategy in Retina

2.1 Decorrelation in the Absence of Noise. In the previous section, we gave some experimental evidence leading to the hypothesis that the goal of retinal processing is to produce a representation with reduced redundancy. This implies a representation where the ganglion cell activities are as decorrelated as possible (more generally, statistically independent), given the inherent problem of input noise in the retina. In this section, we formulate this notion as a mathematical theory of the retina. We first set up the decorrelation problem ignoring noise, and later introduce the simple but important modification needed for noise suppression. The outputs {O(x,)} of the array of ganglion cells are completely decorrelated iff (O(xi)O(xJ)) 6,, where the brackets denote an ensemble average over natural stimuli. In general, due to the presence of noise, the retina will not decorrelate completely. Instead the filter K will only tend to decorrelate (or decorrelate up to a given scale). For this reason it is most natural to formulate the problem in terms of a variational principle with an “energy” or cost functional, E { K } , that grades different kernels according to how well they decorrelate the output. Any constraints on this process are easily incorporated as penalty terms in the energy functional. To find the correct energy functional for decorrelation one may use Wegner’s theorem (Bodewig 19561, which states that

-

det(O(xi)O(xj))I

n(

o’(xI))

(2.1)

I

with equality if and only if the matrix (O(xl)O(xJ))is diagonal. This means that decorrelation can be achieved by keeping det(O(x,)O(x,)) fixed and minimizing n,(@(x,)).One reason for keeping det(O(x,)O(x])) = det(KTRK)fixed is that this ensures a reversible transformation, since it is the same as requiring det(KTK)> 0. [Here we are treating the kernel as a matrix KIJ= K(x, - X J ) . ] Actually, there are a couple of mathematical steps that lead to a simpler energy functional. First, with the assumption of translation invariance we can minimize (O’(x0)) for one ganglion cell at location xo instead of n,(02(xl)).Again by translation invariance, this is equivalent to minimizing the explicitly invariant expression C ,( 0 2 ( x l ) )= Tr(KRKT).Finally, it is more convenient to hold fixed logdet(KTK)rather than det(KTK). Thus4

E{K}

= Tr(KRKT)- p

log det(KTK)

(2.2)

p is a lagrange multiplier used to fix det(KTK)to some value, but since we do not know this value we will subsequently treat p as a parameter penalizing small det(KTK). 4Weshould point out that the decorrelatingfilter K that minimizes 2.2 is not the usual Karhunen-Loeve transform which would be the Fourier transform for translationally invariant R. This KL transform gives a nonlocal, nontranslationallyinvariant K.

202

Joseph J. Atick and A. Norman Redlich

To find the kernel K that minimizes equation 2.2, it is best to work in frequency space, where traces such as Tr(KRKT)become integrals over frequencies. Also, the second term in equation 2.2 can be converted to an integral, by first using the matrix identity log det(KTK)= Tr log(KTK). The equivalent energy functional becomes

E{K}

=

/df IK(f)12 R(f) - p/dflog(K(f)12

(2.3)

which when varied with respect to K(f) gives (2.4)

-

-

With Field’s R(f) l/lfI2, this gives the whitening filter K(f) @If(. Having arrived at the energy functional [equation 2.2 (or 2.311 as the one that produces decorrelation, it is now straightforward to explain its information theoretic interpretation. Minimizing the first term in equation 2.2 is equivalent (see Atick and Redlich 1990) to minimizing the sum of bit entropies C,H, = -~,JdO,P(O,)log[P(O,)], where P ( 0 , ) is the probability density for the ith ganglion cell output 0, = O(x,). The second term in equation 2.2 is the change in entropy H (including correlations, not just bit entropy) due to the retinal transformation, so requiring this term to vanish would impose the constraint that no information is lost - this is related to requiring reversibility, although it is stronger. Therefore minimizing E in equation 2.2 has the effect of reducing the ratio of bit entropy to true entropy: C , H , / H , which is what we mean here by redundancy. Minimizing this ratio reduces the number of bits carrying the information H ; technically, it reduces all but the first order H with equality only when redundancy. Also, one can prove that C, H , I the O(x,) are statistically independent, so minimizing this ratio produces statistically independent outputs. 2.2 Introducing the Noise. Since here we are primarily interested in testing redundancy reduction, we take a somewhat simplified approach to the problem with noise. As discussed earlier, instead of doing a fullfledged information theoretic analysis (as in Atick and Redlich 1990), we work in a formalism where the signal is first low-pass filtered to eliminate noise. The resulting signal is then decorrelated as before. Actually, since we will be comparing with real data, we have now to be more explicit about the stages of processing that we believe precede the decorrelation stage. In Figure 2 we show a schematic of the signal processing stages that we assume take place in the retina. First, images from natural scenes pass through the optical medium of the eye and in doing so their image quality is lowered. It is well known that this effect can be taken

Retina and Natural Scenes

203

\I

-

Optical

Lawpass

Whitening

K

MTF

___)

Figure 2: Schematic of the signal processing stages assumed to take place in the retina.

into account by multiplying the images by the optical rnodulution frunsfev function or MTF of the eye, a function of spatial frequency that is measurable in purely non-neural experiments. In fact, an exponential of the form exp[-(lfl/fc)"], for some scale fc characteristic of the animal (in primates fc 22 c/deg and Q 1.4) is a good approximation to the optical MTF. The resulting image is then transduced by the photoreceptors and is low-pass filtered to eliminate input noise. Finally, we assume that it is decorrelated. In this model, the output-input relation takes the form

-

-

0 = K . [A4 . ( L + n ) + no]

(2.5)

where the dot denotes a convolution as defined in equation 1.2. n ( x ) is the input noise (such as quantum noise) while no(xi) is some intrinsic noise that models postreceptor synaptic noise. Finally, M is the filter that takes into account both the optical MTF as well as the low-pass filtering needed to eliminate noise. An explicit expression for M will be derived below. With this model, the energy functional determining the decorrelation filter K is

where V(f)= (ln(f)I2) and g(f) = (lno(f)I2) are the input and synaptic noise powers, respectively. This energy functional is the same as that in equation 2.3 but with the variance R(f) replaced by the output variance of 0 in equation 2.5.

Joseph J. Atick and A. Norman Redlich

204

As before, the variational equations SE/SK = 0 are easy to solve for K . The experimentally measured filter Kexp is then this variational solution, K , times the filter M: IKexp(f)l =

IWl

M(f) =

M(f)Jir (M2(f) [R(f)

(2.7)

+ P]+ G}’”

An identical result can be obtained in space-time trivially by replacing the autocorrelator R(f) and the filter M(f) by their space-time analogs R(f, w ) and M(f,w),respectively, with w the temporal frequency. However, we focus here on the purely spatial problem where we have Field’s (1987) measurement of the spatial autocorrelator R(f) of natural scenes: R(f) =

C/IflZ. 2.3 Deriving the Low-Pass Filter. In our explicit expression for Kexp, below, we shall use the following low-pass filter

The exponential term is the optical MTF while the first term is a low-pass filter that we derive next. The reader who is not interested in the details of the derivation can skip this section without loss of continuity. It is not clear in the retina what principle dictates the choice of the low-pass filter or how much of the details of the low-pass filter influence the final result. In the absence of any strong experimental hints, of the type that imply redundancy reduction, we shall try a simple information theoretic principle to derive an M: We will insist that the filter M should be chosen such that the filtered signal 0’ = M . (L n ) carries as much information as possible about the ideal signal L subject to some constraint. To be more explicit, the amount of information carried by 0’,about L, is the mutual information I(O’,L). However, as is well known (for L and n statistically independent gaussian variables, see Shannon and Weaver 1949) I(O’, L) = [H(O’)-Noise Entropy], and thus if we maximize I(O’, L ) keeping fixed the entropy H ( 0 ’ ) we achieve a form of noise suppression. We can now formulate this as a variational principle. To simplify the calculation we assume gaussian statistics for all the stochastic variables involved. The output-input relation including quantization units, n4, takes the form 0’= M . ( L n ) n,. A standard calculation leads to

+

+ +

+

Similarly, one finds for the entropy H ( 0 ’ ) = - Jdf log[M(R N2) + I$]. The variational functional or energy for smoothing can then be written

Retina and Natural Scenes

205

as E { M } = -l(O‘, L ) - q H ( 0 ’ ) . It is not difficult to show that the optimal noise suppressing solution 6E/6M = 0 takes the form

-

with the parameter 7 q10 in order to hold H(O’) fixed with mean luminance. Actually, below we will be working in the regime where the quantization units are much smaller than the signal and noise powers and hence we can safely drop the -1 term in M I since the 17 term dominates for small We can also ignore any overall factors in M that are independent of f . This then is the form that we exhibit in the first term in equation 2.8.

q.

2.4 Analyzing the Solution. Let us now analyze the form of the complete solution 2.7, with M given in equation 2.8. In Figure 3 we have plotted Kexp(f) (curve a) for a typical set of parameters. We have also plotted the filter without noise R(f)-*I2 (equation 2.4) (curve b) and M(f) (equation 2.8) (curve c). There are two points to note: at low frequency the kernel Kexp(f ) (curve a) is identically performing decorrelation, and thus its shape in that regime is completely determined by the statistics of natural scenes: the physiological functions M and N drop out. At high frequencies, on the other hand, the kernel coincides with the function M, and the power spectrum of natural scenes R drops out. We can also study the behavior of the kernel in (equation 2.7) as a function of mean luminosity 10. If one assumes that the dominant source of noise is quantum noise, then the dependence of the noise parameter on 10is simply = ION'^ where N’is a constant independent of 10 and independent of frequency (flat spectrum). This gives an interesting result. At low frequency where Kexp goes like 1 1 4 its I. dependence will be Kexp 1/10 (recall R and the system exhibits a Weber law behavior, that is, its contrast sensitivity IoKexp is independent of 10. While in the other regime - at high frequency - where the kernel asymptotes M with P > R then Kexp l/I:l2 which is a De Vries-Rose behavior IoKeXp This predicted transition from Weber to De Vries-Rose with increasing frequency is in agreement with what is generally found (see Kelly 1972, Fig. 3). Given the explicit expression in equation 2.7 and the choice of quantum noise for N we can generate a set of kernels as a function of 10. The resulting family is shown for primates in Figure 4. We need to emphasize that there are no free parameters here which depend on 10. The only variables that needed to be fixed were the numbers fc, a, p, and N’and they are independent of lo. Also we work in units of synaptic noise no, is set to one. We have superimposed so the synaptic noise power on this family the data from the experiments of Van Ness and Bouman (1967) on human psychophysical contrast sensitivity. It does not take N

N

N

N

Joseph J, Atick and A. Norman Redlich

206

1000

300

100

10

3

1 .1

.3

1

3

10

30

100

Spatial frequency, c/deg

Figure 3: Curve a is the predicted retinal filter from equation 2.7 for a typical set of parameters, while curve b is R(f)-'/2, which is the pure whitening filter. Finally, curve c is the low-pass filter M. The figure shows that at low frequencies curves a and b coincide and thus the system is whitening, while at high frequencies curves a and c coincide and thus the retinal filter is determined by the low-pass filter. much imagination to see that the agreement is very reasonable especially keeping in mind that this is not a fit but a parameterfree prediction. 3 Discussion

One major aim of this paper has been to answer the question, what does the retina know about its visual environment? Our initial answer comes from noting that the experimental ganglion cell kernel whitens the spatial power spectrum of natural scenes found in completely independent experiments by Field (1987). This shows that the retinal code has been optimized - assuming whitening as a design principle - for an environment with a lfl-2 spectrum. In other words, the retina knows at least one statistical property of natural scenes: the spatial autocorrelator.

Retina and Natural Scenes

207

F

loo0

300 100

30 10

3 1

.1

.3

3

1

10

30

100

Spatial frequency, c/deg

Figure 4:The family of solid curves are the predicted retinal filters (equation2.7) at different 10 separated by one log units, assuming that the dominant source of input noise is quantum noise (N2 lo). No other parameters depend on lo. The fixed parameters are fc = 22 c/deg, a! = 1.4, p = 2.7 x lo5, N’= 1.0. The data are from human psychophysical contrast sensitivity measurements of Van N

Ness and human (1967). But what is useful about whitening the input signal? One possible answer is that whitening compresses the (photoreceptor) input signal so that it can fit into a channel with a more limited dynamical range, or capacity. Such a limitation may be a physical one in the retina such as at the bipolar cell input synapses or it may be in the ganglion cell output cable, the optic nerve (see also Srinivisan et al. 1982). Another possible explanation for the whitening is Barlow’s idea that a statistically independent, or redundancy reduced representation is desirable as a cortical strategy for processing sensory data. From this point of view, the retinal filter is only performing the first step in reducing redundancy, by reducing second-order statistics (correlation). With this explanation, the capacity limitation is located further back in the brain, and may be best understood as an effective capacity limit, which is due to a computational

Joseph J. Atick and A. Norman Redlich

208

-

bottleneck, for example, the attentional bottleneck of 40 bits/sec. Of course, since redundancy reduction usually allows compression of a signal, there is no reason both explanations for whitening - physical bottleneck in the retina or computational bottleneck in cortex - must be mutually exclusive. Also, to paraphrase Linsker (19891, the brain may create physiological capacity limitations at one stage in order to force an encoding whose true utility is in its use as part of a larger strategy, such as Barlow’s redundancy reduction. There is, however, some evidence favoring the cortical redundancy reduction hypothesis: First, assuming a physiological bottleneck in the retina implies that the output code has a fixed and limited number of states available, and these are fewer than the number of states at the input. If one assumes that all of these outputs states are being used maximally at all luminosities, this produces a dependence on l a that does not match experiment. One finds that such a capacity limitation constraint predicts a Weber ( K 10)type scaling with I. at all frequencies so long as the kernel is bandpass; this is contradicted by experiments that show a significant decrease in contrast sensitivity (Derrington and Lennie 1982), for example, at peak frequency, even while there is little change in the shape of the kernel. Second, some animals show bandpass (whitening) filtering even at very low luminosities where the input signal to noise is such that no capacity limitation is likely. Third, the ganglion cell bandpass characteristic is sharpened at later stages, such as in the LGN, and in monkeys some cortical cells have receptive fields very much like those of ganglion cells (Hubel and Wiesel 1974). Finally, some animals have orientation selective cells already in their retinas. This, together with the third point, suggests that whitening (giving bandpass filtering) is likely to be a first stage in a strategy of visual processing which is continued in the cortex, and which may also explain, for example, orientation selectivity. To finally decide on the true purpose of the retinal whitening of natural scenes will require more experiments. In particular, to avoid some assumptions, it would be best to experimentally measure the correlation between ganglion cell outputs (also cortical cells) for an animal in its natural environment. Because of the need to suppress noise, as shown here, we would predict some correlation for nearby ganglion cells, but a much smaller correlation length for ganglion cells than for the natural luminance signal. Also, the stimulus must be the animal’s natural environment, or at least have a lfl-2 spectrum, because of course any other type of input correlation will show up as output correlation. Beyond such questions about the purpose or presence of decorrelation, we should stress that without considering the problem of noise one cannot fully explain the form of the experimental ganglion cell kernel. In fact, too much whitening of a signal that includes noise can be dangerous. This is an obvious point that has not always been appreciated. We find consideration of this need to suppress noise is the only other ingredient needed in order to explain an abundance of experimental data. It gives

-

Retina and Natural Scenes

209

an explanation of the relatively low peak frequency of the retinal filter in bright light. It also leads to the prediction of a bandpass to lowpass transition with decreasing mean stimulus luminance. In fact, our solutions predict an approximately Weber behavior at low frequencies, and assuming quantum noise, an approximately De Vries-Rose behavior at high frequencies. The same property of our solutions that leads to the observed behavior with changing luminance also explains another set of experiments: a similar bandpass to lowpass transition is observed when the temporal frequency of the stimulus is increased. That is, the effect of lowering 10 is predicted to be very close to the effect of raising temporal frequency. A more complicated relationship between color processing and changes in stimulus frequency is also predicted by our theory, a s is the cone to rod transition. So a very large class of experimental observations can all be explained as the consequence of a single principle. They also, as mentioned, probe more specific properties of an animal's environment, so they further test the dependence of retinal processing on environment. All of these space-time-color-luminance interactions are explored in a separate paper (Atick et al. 1992). Acknowledgment Work supported in part by a grant from the Seaver Institute. References Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Comp. 2, 308-320; and 1990. Quantitative tests of a theory of retinal processing: Contrast sensitivity curves. Report no. IASSNS-HEP-90/51. Atick, J. J., Li, Z., and Redlich, A. N. 1992. Understanding retinal color coding from first principles. To appear in Neural Comp. 1992. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1, 295-311. Barlow, H. B., and Foldiak, P. 1989. The Computing Neuron. Addison-Wesley, New York. Bodewig, E. 1956. Matrix Calculus. North-Holland, Amsterdam. Derrington, A. M., and Lennie, P. 1982. The influence of temporal frequency and adaptation level on receptive field organization of retinal ganglion cells in cat. J. Physiol. 333, 343-366. De Valois, R. L., Morgan, H., and Snodderly,D. M. 1974. Psychophysicalstudies of monkey vision-111. spatial luminance contrast sensitivity tests of macaque and human observers. Vision Res. 14, 75-81. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. SOC.Am. A 4, 2379-2394. Goodall, M. C. 1960. Performance of a stochastic net. Nature (London) 185, 557-558.

210

Joseph J. Atick and A. Norman Redlich

Hubel, D. H., and Wiesel, T. N. 1974. Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neuroi. 158, 267294. Kelly, D. H. 1972. Adaptation effects on spatio-temporal sine-wave thresholds. Vis. Res. 12, 89-101. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems, Vol. 1, D. S. Touretzky, ed., pp. 186-194. Morgan Kaufmann, San Mateo, CA. Shannon, C. E., and Weaver, W. 1949. The Mathematicd Theory of Communication. The University of Illinois Press, Urbana. Srinivisan, M. V., Laughlin, S. B., and Dubs, A. 1982. Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. London Ser. B 216, 427459. Van Ness, F. L., and Bouman, M. A. 1967. Spatial modulation transfer in the human eye. J. Opt. SOC.A m . 57, 401406.

Received 15 July 1991; accepted 3 October 1991.

This article has been cited by: 2. Alan Yuille. 2010. An information theory perspective on computational vision. Frontiers of Electrical and Electronic Engineering in China 5:3, 329-346. [CrossRef] 3. Peng Bian, Liming Zhang. 2010. Visual saliency: a biologically plausible contourlet-like frequency domain approach. Cognitive Neurodynamics 4:3, 189-198. [CrossRef] 4. Karl Friston. 2010. The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience 11:2, 127-138. [CrossRef] 5. G. A. Alvarez, A. Oliva. 2009. Spatial ensemble statistics are efficient codes that can be represented with reduced attention. Proceedings of the National Academy of Sciences 106:18, 7345-7350. [CrossRef] 6. R. Pashaie, N.H. Farhat. 2009. Self-Organization in a Parametrically Coupled Logistic Map Network: A Model for Information Processing in the Visual Cortex. IEEE Transactions on Neural Networks 20:4, 597-608. [CrossRef] 7. SHENG ZHANG, CRAIG K. ABBEY, MIGUEL P. ECKSTEIN. 2009. Virtual evolution for visual search in natural images results in behavioral receptive fields with inhibitory surrounds. Visual Neuroscience 26:01, 93. [CrossRef] 8. Corentin Massot, Jeanny Hérault. 2008. Model of Frequency Analysis in the Visual Cortex and the Shape from Texture Problem. International Journal of Computer Vision 76:2, 165-182. [CrossRef] 9. Wilson S. Geisler. 2008. Visual Perception and the Statistical Properties of Natural Scenes. Annual Review of Psychology 59:1, 167-192. [CrossRef] 10. GAËLLE DESBORDES, MICHELE RUCCI. 2007. A model of the dynamics of retinal activity during natural visual fixation. Visual Neuroscience 24:02. . [CrossRef] 11. Eizaburo Doi, Doru C. Balcan, Michael S. Lewicki. 2007. Robust Coding Over Noisy Overcomplete Channels. IEEE Transactions on Image Processing 16:2, 442-452. [CrossRef] 12. Bijoy K. Ghosh, Ashoka D. Polpitiya, Wenxue Wang. 2007. Bio-Inspired Networks of Visual Sensors, Neurons, and Oscillators. Proceedings of the IEEE 95:1, 188-214. [CrossRef] 13. Antonino Casile , Michele Rucci . 2006. A Theoretical Analysis of the Influence of Fixational Instability on the Development of Thalamocortical ConnectivityA Theoretical Analysis of the Influence of Fixational Instability on the Development of Thalamocortical Connectivity. Neural Computation 18:3, 569-590. [Abstract] [PDF] [PDF Plus] 14. Tatyana O. Sharpee, Hiroki Sugihara, Andrei V. Kurgansky, Sergei P. Rebrik, Michael P. Stryker, Kenneth D. Miller. 2006. Adaptive filtering enhances information transmission in visual cortex. Nature 439:7079, 936-942. [CrossRef]

15. Michael S. Falconbridge , Robert L. Stamps , David R. Badcock . 2006. A Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural ImagesA Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural Images. Neural Computation 18:2, 415-429. [Abstract] [PDF] [PDF Plus] 16. Simon Osindero , Max Welling , Geoffrey E. Hinton . 2006. Topographic Product Models Applied to Natural Scene StatisticsTopographic Product Models Applied to Natural Scene Statistics. Neural Computation 18:2, 381-414. [Abstract] [PDF] [PDF Plus] 17. T. J. Gardner, M. O. Magnasco. 2005. Instantaneous frequency decomposition: An application to spectrally sparse sounds with fast frequency modulations. The Journal of the Acoustical Society of America 117:5, 2896. [CrossRef] 18. L. Perrinet, M. Samuelides, S. Thorpe. 2004. Coding Static Natural Images Using Spiking Event Times: Do Neurons Cooperate?. IEEE Transactions on Neural Networks 15:5, 1164-1175. [CrossRef] 19. MICHELE RUCCI, ANTONINO CASILE. 2004. Decorrelation of neural activity during fixational instability: Possible implications for the refinement of V1 receptive fields. Visual Neuroscience 21:05. . [CrossRef] 20. K.A. Zaghloul, K. Boahen. 2004. Optic Nerve Signals in a Neuromorphic Chip I: Outer and Inner Retina Models. IEEE Transactions on Biomedical Engineering 51:4, 657-666. [CrossRef] 21. Hiroyasu Koshimizu, Noriko Nagata, Takio Kurita, Kunihito Kato, Katsuhiko Sakaue, Kazuhiko Yamamoto. 2004. Prospect on Current Topics of Machine Vision Technologies. IEEJ Transactions on Electronics, Information and Systems 124:3, 586-597. [CrossRef] 22. Song-Chun Zhu. 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:6, 691-712. [CrossRef] 23. Si Wu , Danmei Chen , Mahesan Niranjan , Shun-ichi Amari . 2003. Sequential Bayesian Decoding with a Population of NeuronsSequential Bayesian Decoding with a Population of Neurons. Neural Computation 15:5, 993-1012. [Abstract] [PDF] [PDF Plus] 24. S. Konishi, A.L. Yuille, J.M. Coughlan, Song Chun Zhu. 2003. Statistical edge detection: learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1, 57-74. [CrossRef] 25. Yury Petrov, L. Zhaoping. 2003. Local correlations, information redundancy, and sufficient pixel depth in natural images. Journal of the Optical Society of America A 20:1, 56. [CrossRef] 26. M.S. Bartlett, J.R. Movellan, T.J. Sejnowski. 2002. Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13:6, 1450-1464. [CrossRef]

27. Norberto M. Grzywacz , Rosario M. Balboa . 2002. A Bayesian Framework for Sensory AdaptationA Bayesian Framework for Sensory Adaptation. Neural Computation 14:3, 543-559. [Abstract] [PDF] [PDF Plus] 28. Dávid Bálya, Botond Roska, Tamás Roska, Frank S. Werblin. 2002. A CNN framework for modeling parallel processing in a mammalian retina. International Journal of Circuit Theory and Applications 30:2-3, 363-393. [CrossRef] 29. P. Abshire, A.G. Andreou. 2001. Capacity and energy cost of information in biological and silicon photoreceptors. Proceedings of the IEEE 89:7, 1052-1064. [CrossRef] 30. Eero P Simoncelli, Bruno A Olshausen. 2001. NATURAL IMAGE STATISTICS AND NEURAL REPRESENTATION. Annual Review of Neuroscience 24:1, 1193-1216. [CrossRef] 31. Christoph Zetzsche, Gerhard Krieger. 2001. Nonlinear mechanisms and higher-order statistics in biological vision and electronic image processing: review and perspectives. Journal of Electronic Imaging 10:1, 56. [CrossRef] 32. Rosario M. Balboa , Norberto M. Grzywacz . 2000. The Minimal Local-Asperity Hypothesis of Early Retinal Lateral InhibitionThe Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition. Neural Computation 12:7, 1485-1517. [Abstract] [PDF] [PDF Plus] 33. Brian Blais , Leon N. Cooper , Harel Shouval . 2000. Formation of Direction Selectivity in Natural Scene EnvironmentsFormation of Direction Selectivity in Natural Scene Environments. Neural Computation 12:5, 1057-1066. [Abstract] [PDF] [PDF Plus] 34. K.A. Boahen. 2000. Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47:5, 416-434. [CrossRef] 35. Emilio Salinas , L. F. Abbott . 2000. Do Simple Cells in Primary Visual Cortex Form a Tight Frame?Do Simple Cells in Primary Visual Cortex Form a Tight Frame?. Neural Computation 12:2, 313-335. [Abstract] [PDF] [PDF Plus] 36. Michael S. Langer. 2000. Large-scale failures of f-α scaling in natural image spectra. Journal of the Optical Society of America A 17:1, 28. [CrossRef] 37. Azriel Rosenfeld, Harry Wechsler. 2000. Pattern recognition: Historical perspective and future directions. International Journal of Imaging Systems and Technology 11:2, 101-116. [CrossRef] 38. D. R. C Dominguez, M Maravall, A Turiel, J. C Ciria, N Parga. 1999. Numerical simulation of a binary communication channel: Comparison between a replica calculation and an exact solution. Europhysics Letters (EPL) 45:6, 739-744. [CrossRef] 39. Antonio Turiel, Elka Korutcheva, Néstor Parga. 1999. Journal of Physics A: Mathematical and General 32:10, 1875-1894. [CrossRef]

40. Mitchell G. A. Thomson. 1999. Higher-order structure in natural scenes. Journal of the Optical Society of America A 16:7, 1549. [CrossRef] 41. A.B. Torralba, J. Herault. 1999. An efficient neuromorphic analog network for motion estimation. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 46:2, 269-280. [CrossRef] 42. G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, T.J. Sejnowski. 1999. Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:10, 974-989. [CrossRef] 43. D. Obradovic , G. Deco . 1998. Information Maximization and Independent Component Analysis: Is There a Difference?Information Maximization and Independent Component Analysis: Is There a Difference?. Neural Computation 10:8, 2085-2101. [Abstract] [PDF] [PDF Plus] 44. Alexander Dimitrov , Jack D. Cowan . 1998. Spatial Decorrelation in Orientation-Selective Cortical CellsSpatial Decorrelation in Orientation-Selective Cortical Cells. Neural Computation 10:7, 1779-1795. [Abstract] [PDF] [PDF Plus] 45. Juan K. Lin, David G. Grier, Jack D. Cowan. 1997. Faithful Representation of Separable DistributionsFaithful Representation of Separable Distributions. Neural Computation 9:6, 1305-1320. [Abstract] [PDF] [PDF Plus] 46. Roland Baddeley. 1997. The Correlational Structure of Natural Images and the Calibration of Spatial Representations. Cognitive Science 21:3, 351-372. [CrossRef] 47. Joseph J Atick, Paul A Griffin, A Norman Redlich. 1996. The vocabulary of shape: principal shapes for probing perception and neural response. Network: Computation in Neural Systems 7:1, 1-5. [CrossRef] 48. Fabrizio Gabbiani. 1996. Network: Computation in Neural Systems 7:1, 61-85. [CrossRef] 49. K.A. Boahen. 1996. A retinomorphic vision system. IEEE Micro 16:5, 30-39. [CrossRef] 50. Michael Haft, Martin Schlang, Gustavo Deco. 1995. Information theory and local learning rules in a self-organizing network of Ising spins. Physical Review E 52:3, 2860-2871. [CrossRef] 51. Robert G. Smith. 1995. Simulation of an anatomically defined local circuit: The cone-horizontal cell network in cat retina. Visual Neuroscience 12:03, 545. [CrossRef] 52. Gustavo Deco, Bernd Schürmann. 1995. Learning time series evolution by unsupervised extraction of correlations. Physical Review E 51:3, 1780-1790. [CrossRef] 53. Dawn M. Adelsberger-Mangan, William B Levy. 1994. The influence of limited presynaptic growth and synapse removal on adaptive synaptogenesis. Biological Cybernetics 71:5, 461-468. [CrossRef]

54. David J. Field . 1994. What Is the Goal of Sensory Coding?What Is the Goal of Sensory Coding?. Neural Computation 6:4, 559-601. [Abstract] [PDF] [PDF Plus] 55. Zhaoping Li , Joseph J. Atick . 1994. Toward a Theory of the Striate CortexToward a Theory of the Striate Cortex. Neural Computation 6:1, 127-146. [Abstract] [PDF] [PDF Plus] 56. Dawn M. Adelsberger-Mangan, William B. Levy. 1993. Adaptive synaptogenesis constructs networks that maintain information and reduce statistical dependence. Biological Cybernetics 70:1, 81-87. [CrossRef] 57. A. Norman Redlich . 1993. Supervised Factorial LearningSupervised Factorial Learning. Neural Computation 5:5, 750-766. [Abstract] [PDF] [PDF Plus] 58. A. Norman Redlich . 1993. Redundancy Reduction as a Strategy for Unsupervised LearningRedundancy Reduction as a Strategy for Unsupervised Learning. Neural Computation 5:2, 289-304. [Abstract] [PDF] [PDF Plus] 59. Joseph J. Atick , A. Norman Redlich . 1993. Convergent Algorithm for Sensory Receptive Field DevelopmentConvergent Algorithm for Sensory Receptive Field Development. Neural Computation 5:1, 45-60. [Abstract] [PDF] [PDF Plus] 60. Joseph J. Atick , Zhaoping Li , A. Norman Redlich . 1992. Understanding Retinal Color Coding from First PrinciplesUnderstanding Retinal Color Coding from First Principles. Neural Computation 4:4, 559-572. [Abstract] [PDF] [PDF Plus] 61. Robert B. Pinter, Abdesselem Bouzerdoum, Bahram NabetCybernetics . [CrossRef]

Communicated by Haim Sompolinsky

A Simple Network Showing Burst Synchronization without Frequency Locking Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, C A 91125 U S A

Heinz Schuster Institut fur tkeoretische Physik, Uniuersitat Kiel, Olshausensfrasse40,2300 Kiel 1, Germany

The dynamic behavior of a network model consisting of all-to-all excitatory coupled binary neurons with global inhibition is studied analytically and numerically. We prove that for random input signals, the output of the network consists of synchronized bursts with apparently random intermissions of noisy activity. We introduce the fraction of simultaneously firing neurons as a measure for synchrony and prove that its temporal correlation function displays, besides a delta peak at zero indicating random processes, strongly dampened oscillations. Our results suggest that synchronous bursts can be generated by a simple neuronal architecture that amplifies incoming coincident signals. This synchronization process is accompanied by dampened oscillations that, by themselves, however, do not play any constructive role in this and can therefore be considered to be an epiphenomenon. 1 Introduction

Recently synchronization phenomena in neural networks have attracted considerable attention. This was mainly due to two experimental observations. First, Gray et al. (1989), Engel et al. (19901, as well as Eckhorn ef a!. (1988) (see also Freeman 1978; Wilson and Bower 1991) provided electrophysiological evidence that neurons in the visual cortex of cats discharge in a semisynchronous, oscillatory manner in the 40 Hz range and that the firing activity of neurons up to 10 mm away is phase-locked with a mean phase shift of less than 3 msec. It has been proposed that this phase synchronization can solve the binding problem for figure-ground segregation (von der Malsburg and Sehneider 1986) and underlie visual attention and awareness (Crick and Koch 1990). Second, synchronous bursts converging on a postsynaptic target cell will produce large depolarizations that are optimal for activating NMDA receptors leading to Neural Computation

4, 211-223 (1992)

@ 1992 Massachusetts Institute of Technology

212

Christof Koch and Heinz Schuster

long-term potentiation (Brown et al. 1990). This suggests the possibility that the induction of plasticity requires temporal synchronization of synaptic input. A number of theoretical explanations based on coupled (relaxation) oscillator models have been proposed for burst synchronization (Sompolinsky et al. 1989; Kammen et al. 1990). The crucial issue of phase synchronization has also recently been addressed by Bush and Douglas (19911, who simulated the dynamics of a network consisting of bursty, layer V pyramidal cells coupled to a common pool of basket cells inhibiting all pyramidal cells.' The cells were modeled using Hodgkin and Huxley-like dynamics. Bush and Douglas found that excitatory interactions between the pyramidal cells increases the total neural activity as expected and that global inhibition leads to synchronized bursts with random intermissions. These population bursts appear to occur in a random manner in their model. The basic mechanism for the observed burst synchronization is hidden in the numerous anatomical and biophysical details of their model. These, and the related observation that to date no strong oscillations have been recorded in the neuronal activity in visual cortex of awake monkeys, prompted us to investigate how phase synchronization can occur in the absence of frequency locking. We proceed by replacing the cortical architecture of Bush and Douglas (1991) by a simple, exactly solvable model of all-to-all, excitatory coupled binary McCulloch-Pitts neurons (1943) that are globally connected to one inhibitor that we simulate by an activity-dependent common threshold. We find that for random uncorrelated inputs the output of the network consists of synchronized bursts with seemingly random intermissions. This shows that burst synchronization is a generic feature of such a neuronal architecture, which amplifies incoming coincident signals to synchronous bursts. Whenever several input signals coincide, they excite the network to a global burst of activity that is subsequently shut down by the inhibition. The minimal number of coincidences that is needed to trigger collective bursting increases with increasing O/w, where 0 is the threshold of the neurons and w measures the strength of the excitatory coupling. For O/w + 0 the interburst interval decreases until one sees only a regular sequence of global bursts each followed immediately by zero activity. Therefore, the output of the network varies from essentially randomly separated synchronous bursts (for O/w < 1) to regular series of on-off activity (for O/w -+ 0). To substantiate these statements, we analyze the fraction m of synchronously firing neurons as a function of the random input activity. We show that the autocorrelation of the neuronal activity rn displays, in addition to a peak at zero time indicating random bursts, a tail that decays exponentially in an oscillatory fashion. The origin of this

-

'This model bears similarities to Wilson and Bower's (1991) model describing the origin of phase locking in olfactory cortex.

Simple Network Showing Burst Synchronization

213

damped oscillation can be traced back to the global inhibitory feedback. For O/w 2 1, these oscillations are very strongly damped, nevertheless the network displays a periodic synchronous bursting whose interburst intervals are independent of the oscillatory period. This means that these oscillations play no constructive role for burst synchronization. 2 A Coincidence Network

We consider n excitatory coupled binary McCulloch-Pitts (1943) neurons whose output x:" E [0,1]at time t 1 is given by

+

(2.1)

Here w / n > 1 is the normalized excitatory all-to-all synaptic coupling, 0 and 0 elsewhere. Each neuron has the same dynamic threshold 6' > 0. Next we introduce the fraction rn' of neurons that fire simultaneously at time t: (2.2)

rnr 5 1; only if every neuron is active at time t do we have In general, 0 I m' = 1. By summing equation 2.1 we then obtain the following equation of motion for our simple network: 1

mtfl = n

c

c7[wm'

+ ,c;

- 01

(2.3)

i

The behavior of this finite-state automata (it can take on all the n t l states characterized by m' = i/n, with 0 5 i 5 n), is then fully described by the phase-state diagram of Figure 1. If 8 > 1 and O/w > 1, then the output of the network rn' will vary with the input until at some time t', rnt' = 0. Since the threshold 6' is now always larger than the input, the network will remain in this state, that is, rn'" = 0, for all t" > t'. If, on the other hand, the threshold 0 < 1 and smaller than the weight, that is, Q/w < 1, the network will drift until it comes to the state mf' = 1. Since from then on wm' is at all times larger than the threshold, the network remains latched at mt = 1. If 6' > 1, but 6'/w < 1, the network can latch in either the rnt = 0 or the mt = 1 state and will remain there indefinitely. Lastly, if 0 < 1, but S/w > 1, the threshold is by itself not large enough to keep the network latched into the mt = 1 state. If we define the normalized input activity or noise (2.4)

Christof Koch and Heinz Schuster

214

Weight w

I

0

0 Threshold 8

Figure 1: Phase diagram for the network described by equation 2.3. Different regions correspond to different stationary output states m' in the long time limit. For details see text. with 0 5 s/ 5 1, we see that in this part of phase space m'+' = s'; in other words, the total output activity faithfully reflects the input activity at the previous time step. Let us now increase the behavioral repertoire of our network by introducing an adaptive time-dependent threshold, Q', motivated by the use of global inhibition in Bush and Douglas (1991). We assume that 8' remains at its value Q < 1 as long as the total activity remains less than 1. If, however, mt = 1, we increase 0' to a value larger than w 1. This has the effect of resetting the activity of the entire network to 0 in the next time step, that is, m'+' = ( l / n ) C,CI[W - (w+ 1+ c)] = 0. The threshold will then automatically reset itself to its old value. In other words, we will now consider the case of

+

with

%(m')=

for mf < 1 for m' = 1

Simple Network Showing Burst Synchronization

215

Therefore, we are operating in the topmost left part of Figure 1 but preventing the network from latching to rnf = 1 by resetting it. Such a dynamic threshold bears some similarities to the models of Horn and Usher (1989,19901, Treves and Amit (1989), and others, but is much simpler. Note that O(m') exactly mimics the effect of a common inhibitory neuron that is excited only if all neurons fire simultaneously. Our network now acts as a coincidence detector, such that all neurons = 1 if will "fire" at time t + 2, that is, x!+2 = 1, for all i, and at least k neumns receive at time t a "1" as input. k is the smallest integer with k > O.n/w. If the network receives at least k such inputs, the network will react to this two time steps later by discharging all neurons, with mtf2 = 1. The threshold O ( r n f ) is then transiently increased and the network is reset and the game begins anew. In other words, the network detects coincidences and signals this by a synchronized burst of neuronal activity followed by a brief respite of activity. Figure 2 shows the typical behavior of our network. The time dependence of mf given by equation 2.5 can be written as

m'+'

=

for o 5 m1 < O / W for 6 / w 5 mr < 1 for rnt = 1

{

(2.6)

By introducing functions A ( m ) , B(m), C(m),which take on the value 1 in the intervals specified for rn = mf in equation 2.6, respectively, and zero elsewhere (see also Fig. 3), we find that mf+' can be written as: mf+' - sf A(m') 1 . B(m') + 0 . C(rn') (2.7)

+

This equation can be iterated, yielding an explicit expression for mf as a function of the external inputs st+', . . . ,so and the initial value mO:

( )

mf = (sf-', l,O)M(s'-*). ..M(so) B(mo)

(2.8)

with the matrix A(s) 0 1 M ( s ) = B(s) 0 0 ) C(s) 1 0

(

Equation 2.8 is the principal result of this section. It shows that the dynamics of our network model can be solved explicitly, by iteratively applying M , the transformation matrix, t - 1 number of times to the initial network configuration. 3 Distribution of Bursts and Time Correlations

We have seen in the previous section in equation 2.8 that the synchronous activity at time t depends on the specific realization of the input signals

Christof Koch and Heinz Schuster

216

output

rn

50

100

io

200

150

200

Time

I Input

0.8 0.6 0.4

0.i 50

I00

Time

Figure 2: Time dependence of the fraction mr = ( l / n ) El x: of output neurons that fire simultaneously compared to the corresponding fraction of input signals st = ( l / n ) El(:,for n = 20 and O/w = 0.225. The input variables are independently distributed according to P((:) = p6( 0.225 (dotted line; this corresponds to at least five input signals with = 1) lead to the entire population firing in synchrony two time steps later, that is, mf+2= 1. Note the "random" appearance of the interburst intervals. All simulations were carried out using MATHEMATICA.

(:

<:

at different times. To get rid of this ambiguity we have to resort to averaged quantities where averages are understood over the distribution P{s'} of inputs st = l / n Cr=, A very useful averaged quantity is the probability F"(m),describing the fraction rn of simultaneously firing

[:.

Simple Network Showing Burst Synchronization

217

e/w

0

1-&

1

m Figure 3: Regions on the maxis when the functions A ( m ) ,B(m),and C ( m )have value 1. Outside these regions all these functions are zero. neurons at time t. P'(m) is related to the probability distribution P{s'} via

P ' ( m ) = (6[m - m'{s'-1>.. .SO}])

(3.1)

where (. . .) denotes the average with respect to @ { s t } and mf{st-l.. . . .so} is given by equation 2.8. If the input signals are uncorrelated in time, m'+' depends according to equation 2.7 only on m', and the time evolution of P'(m) is described by the Chapman-Kolmogorov equation:

<:

1

1

P'+'(m) =

0

dm'K(m 1 m')P'(m')

(3.2)

with the integral kernel

K ( m 1 m')

1 1

=

0

=

dsk(s)6[rn- sA(rn') - B(m') - 0 . C(m')]

P(m)A(m') + b(m - l)B(m')

+ G(m)C(m')

(3.3)

Iteration of equations 3.2 and 3.3 yields: -t-1

P'(m) = [P(m).6(m - 1 ) . h ( m ) ] M v

(3.4)

where

(3.5)

Christof Koch and Heinz Schuster

218

and

Jd F ( s ) B ( s ) ~ s Ji;, 1

71 =

1

P(s)dS

=

W

Notice that 0 5 7 5 1 holds. Here we used the facts that the distribution P ( s ) is normalized to unity, Jd P ( s ) [ A ( s ) B ( s ) C(s)]ds= 1, and C ( m ) is only 1 at the point m = 1, that is, $dsP(s)C(s) = 0. The starting vector v is related to the initial distribution p(u)via v = [J,’dmA(m)p(rn), J,’ d m B ( m ) p ( m ) ,J dmC(m)P“(m)]. i Equations 3.4 and 3.5 can be solved in terms of the eigenvalues and eigenvectors of M and we find

+

~ ‘ ( m=)P”(m)

+ P ( m ) - P”(mj1

+

.f(t)

(3.6)

where

P“(m) =

1

~

1

+ 271 [ij\m) + i l d ( r n - 1)+ r/h(m)]

(3.7)

is the limiting distribution which evolves from the initial distribution P”(m)for large times, because the factor f ( t ) = qtI2cos(SIf),where 11

=

- arctan[J4*r,

72/71

decays exponentially with time. Equations 3.6 and 3.7 show that the limiting equilibrium distribution P“(m) evolves from the initial distribution P ( m )in an oscillatory fashion, with the building up of two delta functions at m = 1 and m = 0 at the expense of P ( m ) . This signals the emergence of synchronous bursts, that is, m‘ = 1, which are always followed at the next time step by zero activity, that is, mftl = 0 (see also Fig. 2). The mean fraction (m‘) = dmP’(m)m of synchronized neurons evolves as

Jd

(mf) = (m”)

-

RmO)- (m”)lf(t)

(3.8)

We obtain from equation 3.7 that the equilibrium value (3.9)

Ji

which is larger than the initial value (s) = dsP(s)s, for (s) < 1/2, indicating an increase in synchronized bursting activity. We saw that the equilibrium state of the system is approached in an oscillatory fashion. It is therefore interesting to ask what type of time correlations will develop in the output of our network if it is stimulated with uncorrelated noise,
[(mf+7m‘) - (m’)2]

(3.10)

Simple Network Showing Burst Synchronization

219

can be computed directly since m' and P"(m) are known explicitly from equations 3.7 and 3.8. We find

+

+ 'p)

(3.11)

~ ( r=)hT,oco (I - S,,o)C1,r$T//2cos(n2.r

with ST,o the Kroneker symbol, with ST,o = 1 for T = 0 and 0 else.2 Figure 4 shows that C ( T ) from equation 3.10 consists of two parts. A delta peak at T = 0 that reflects random uncorrelated bursting and an oscillatory decaying part that indicates correlations in the output. The period of the oscillations 2ir 2ir T=-= (3.12) ir - arctan[J-/ql varies monotonically between 3 5 T 5 4 as O/w moves from zero to one. Since q is given by Jj/,P(s)ds, we see that the strengths of these oscillations increase as the excitatory coupling w increases. The emergence of periodic correlations can be understood in the limit 0/w + 0, where the period T becomes three (and 77 = ,fJP(s)ds = l),because according to equation 2.6, m f= 0 is followed by m'+' = sf, which leads for O/w + 0 always to mff2 = 1 followed by mt+3 = 0. In other words, the temporal dynamics of m' has the form Os'10s410s710s'010.. .. In the opposite case of O/w + 1 , q converges to 0 and the autocovariance function C ( T )essentially contains the peak only at T = 0. Thus, the output of the network ranges from completely uncorrelated noise for O/w M 1 to correlated periodic bursts for 8/w + 0. Figure 4 shows the correlation function for two intermediate situations. The amplitude of the Fourier transform of the autocovariance function C(T),that is the power spectrum of the system, has the form

P(w,)= co +

2c, a

a2

(3.13)

+ (Wf - f2)2

with a = -log 712. In other words, a broad Lorentzian centered at the oscillation frequency, superimposed onto a constant background corresponding to uncorrelated neural activity. It is important to discuss in this context the effect of the size n of the network. If the input variables [ f are distributed independently in time and space with probabilities Pi([:), then the distribution P(s) has a width3 which decreases as l / f i as n -+ o. Therefore, in a large system q = Ji,,P(s)ds is either 0 if 0/w > (s) or 1 if 0 / w < (s), where (s) is the

+

2The constants Co,C, and 'p can be determined from C ( T ) = &,0(m2), (1 - (m)',, where b = [(A(m)m),, (B(m)m),, (C(m)m),l and (. ..)oo denotes the average over P"(m). For q = 0 th's yields CO = (s') - ( s ) ~ ; C1 = C2 = 0 and for 1) = 1, P(s) = 6(s - p),p > S/w with C(0) = CO= (p' + 1)/3 - (p 1)z/9,C ( 1 ) = p/3 - ( p + 1)2/9,C(2) = C(1). C(3) = C(O), that is, an oscillati_onwith period 3. 3Themean squared fluctuation is (s2) - ( s ) = ~ J d<: . . . J d
+

Christof Koch and Heinz Schuster

220

Time

-0.75

-1

Figure 4: Time dependence of the autocorrelation function C ( r ) = Iimt+m [ ( r ~ ' + ~ r n-' ) (d)'] for two different values of = d s j ( s ) . The top figure corresponds to 77 = 0.8 and a period T = 3.09; the bottom correlation function is for 17 = 0.2 with an associated T = 3.50. Note the different time

Ji,u

scales.

mean value of s, which coincides for n -+. 0;) with the maximum of P(s). If q = 0 the correlation function is a constant according to equation 3.11, while the system will exhibit undamped oscillations with period 3 for q = 1. An earlier example where the mean activity of a large popuIation of neurons converges either to a fixed point or to a limit cycle has been discussed by Sompolinsky (1988). Therefore, the irregularity of the burst intervals, as shown, for instance, in Figure 2, is for independent a finite

Simple Network Showing Burst Synchronization

221

size effect. Such synchronized dephasing due to finite size has already been reported by Sompolinsky ef al. (1989). the width of However, for biologically realistic correlated inputs P(s) can remain finite for n >> 1. For example, if the inputs ti... . .tt, can be grouped into q correlated sets ti.. .ti.(:. . .E:, . . . . t i .. .ti,with finite q, then the width of P(s) scales like l/&. Our model, which now effectively corresponds to a situation with a finite number q of inputs, leads in this case to irregular bursts that mirror and amplify the correlations present in the input signals, with an oscillatory component superimposed due to the dynamic threshold.

<:,

4 Conclusions and Discussion

We here suggest a mechanism for burst synchronization that is based on the fact that excitatory coupled neurons fire in synchrony whenever a sufficientnumber of input signals coincide. In our model, common inhibition shuts down the activity after each burst, making the whole process repeatable. But the inhibition does not entrain signals in contrast to previous suggestions (e.g., Lytton and Sejnowski 1991). It is rather satisfactory to us that our simple model shows similarities to the dynamic behavior of the much more detailed biophysical simulations of Bush and Douglas (1991). They use in their model neurons that differ in their firing rates due to differences in cellular parameters. We use neurons with random input that generate random firing in the absence of strong coupling. In both models, all-to-all excitatory coupling leads, together with common inhibition, to burst synchronization without frequency locking. In our analysis we updated all neurons in parallel. The same model has been investigated numerically for serial (asynchronous) updating, leading to qualitatively similar results, Furthermore, very similar results should be obtained with the use of continuous neurons instead of our binary ones (Hopfield 1984). The output of our network develops oscillatory correlations whose range and amplitude increase as the excitatory coupling is strengthened. However, these oscillations do not depend on the presence of any neuronal oscillators, as in our earlier models (e.g., Kammen ef al. 1990; Schuster and Wagner 1990; Niebur et al. 1991). The period of the oscillations reflects essentially the delay between the inhibitory response and the excitatory stimulus and varies only little with the amplitude of the excitatory coupling and the threshold. The crucial role of inhibitory interneurons in controlling the 40 Hz neuronal oscillations has been emphasized by Wilson and Bower (1991) in their simulations of olfactory and visual cortex. Our model shows complete synchronization, in the sense that all neurons fire at the same time. This suggests that the occurrence of tightly synchronized firing activity across neurons (Freeman 1978; Eckhorn et al. 1988; Gray et al. 1989; Engel et al. 1990; Wilson and Bower 1991) is more

222

Christof Koch and Heinz Schuster

important for feature linking and binding than the locking of oscillatory frequencies. Since the specific statistics of the input noise is, via coincidence detection, mirrored in the burst statistics, we speculate that our network - acting as an amplifier for the input noise - can play an important role in any mechanism for feature linking that exploits common noise correlations of different input signals. Acknowledgments We thank R. Douglas for stimulating discussions and for inspiring us to think about this problem and H. Sompolinsky for pointing out the importance of finite size effects. Our collaboration was supported by the Stiftung Volkswagenwerk. The research of C. K. is supported by the National Science Foundation, the James McDonnell Foundation, and the Air Force Office of Scientific Research. References Brown, T. H., Kairiss, E. W., and Keenan, C. L. 1990. Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rev. Neurosci. 13, 475-511. Bush, P. C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3,19-30. Crick, F., and Koch, C. 1990. Towards a neurobiological theory of consciousness. Semin. Neurosci. 2, 263-275. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Engel, A. K., Konig, P., Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. I . Neurosci. 2, 58-6. Freeman, W. J. 1978. Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalogr. Clin. Neurophyswl. 44,586605. Gray, C. M., Engel, A. K., Konig, P., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. I. Nmrosci. 2,607419. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory response in cat visual cortex exhibits inter-columnar synchronization which reflects global stimulus attributes. Nature (London) 338, 334-337. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Horn, D., and Usher, M. 1989. Neural networks with dynamicd thresholds. Phys. Rev. A 40, 1036. Horn, D., and Usher, M. 1990. Excitatory-inhibitorynetworks with dynamical thresholds. Int. 1. Neural Syst. 1, 249-257.

Simple Network Showing Burst Synchronization

223

Kammen, D., Holmes, P., and Koch, C. 1990. Collective oscillations in neuronal networks. In Advances in Neural Information Processing Systems, Vol. 2, D. Touretzky, ed., pp. 76-83. Morgan Kaufmann, San Mateo, CA. Lytton, W. W., and Sejnowski, T. J. 1991. Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. 1.Neurophysiol. 66, 1-22. McCulloch, W. S., and Pitts, W. A. 1943. A logical calculus of the ideas immanent in neural nets. Bull. Math. Biophys. 5, 115-137. Niebur, E., Kammen, D. M., Koch, C., Ruderman, D., and Schuster, H. G. 1991. Phase-coupling in two-dimensional networks of interacting oscillators. In Advances in Neural Information Processing Systems, D. S. Touretzky, and R. Lippman, eds., pp. 123-129. Morgan Kaufmann, San Mateo, CA. Schuster, H. G., and Wagner, P. 1990. A model for neuronal oscillations in the visual cortex: I Mean-field theory and the derivation of the phase equations. Biof. Cybern. 64,77-82. Sompolinsky, H. 1988. Statistical mechanics of neural networks. Phys. Today 40, 2-12. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1989. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204. Sporns, O., Gally, J. A., Reeke, G. N., Jr., and Edelman, G. M. 1989. Reentrant signalling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Treves, A., and Amit, D. 1989. Low firing rates: An effective Hamiltonian for excitatory neurons. I. Phys. A. 22,2205-2226. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 2940. Wilson, M. A., and Bower, J. M. 1992. Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. 1. Neurophysiot., in press.

Received 2 April 1991; accepted 26 August 1991.

This article has been cited by: 2. Christian Eurich, J. Herrmann, Udo Ernst. 2002. Finite-size effects of avalanche dynamics. Physical Review E 66:6. . [CrossRef] 3. R. Eckhorn. 1999. Neural mechanisms of scene segmentation: recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks 10:3, 464-479. [CrossRef] 4. Marius Usher , Martin Stemmler , Christof Koch , Zeev Olami . 1994. Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field PotentialsNetwork Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials. Neural Computation 6:5, 795-836. [Abstract] [PDF] [PDF Plus] 5. J. Deppisch, K. Pawelzik, T. Geisel. 1994. Uncovering the synchronization dynamics from correlated neuronal activity quantifies assembly formation. Biological Cybernetics 71:5, 387-399. [CrossRef] 6. David E. Cairns , Roland J. Baddeley , Leslie S. Smith . 1993. Constraints on Synchronizing Oscillator NetworksConstraints on Synchronizing Oscillator Networks. Neural Computation 5:2, 260-266. [Abstract] [PDF] [PDF Plus] 7. William R. Softky , Christof Koch . 1992. Cortical Cells Should Fire Regularly, But Do NotCortical Cells Should Fire Regularly, But Do Not. Neural Computation 4:5, 643-646. [Citation] [PDF] [PDF Plus]

Communicated by Richard Lippmann

On a Magnitude Preserving Iterative MAXnet Algorithm Bruce W . Suter Matthew Kabrisky Department of Electrical and Computer Engineering, Air Force Institute of Technology, Wright-Putterson AFB, OH 45433 USA

A new iterative maximum picking neural net (MAXnet) is presented. This formulation determines the value and the location either for a unique maximum or for multiple maxima. This new net converges, for many commonly occurring distributions, in O(1og M ) iterations using only simple computing elements. 1 Introduction

There are many types of neural networks that pick a maximum value. Some are recurrent, must be initialized, and converge over time to find the node with the maximum input. These include the present algorithm and those in Lippmann et al. (19871, Winters and Rose (1989), Hopfield (1984), and Grossberg (1973). One of these (Winters and Rose 1989) converges in log(M) iterations and requires switching nodal elements. Some networks find the maximum value continuously with time varying inputs. These include feedforward comparator nets (Lippmann et al. 1987; Martin 1970) that require many simple threshold-logic nodes and a VLSI circuit (Lazzaro etal. 1988; Mead 1989). This circuit adjusts a global voltage threshold dynamically to be just below the log of the global maximum input value. The algorithm described in this paper converges after initialization to find the maximum input. This convergence (for many commonly occurring data distributions) is as fast as the algorithm in Winters and Rose (1989)but requires different types of nodal computation elements and no nodal switches. 2 New Algorithm

The output of the jth MAXnet node at time t, v,i(t),is defined by __

vl(t + 1) = v,(O)H[v,(O)- v*(t)];

j = 1,.. . , M , t 2 0

(2.1)

-

In this equation v * ( t ) , the mean value of the positive vj(t) values that Neural Computation 4, 224-233 (1992)

@ 1992 Massachusetts Institute of Technology

Iterative MAXnet Algorithm

225

remain at time t, is given by

and

k=l

H is the Heaviside function defined by H ( x )=

x 2 0 otherwise

Equations 2.1 and 2.2 form the basis of the Hardware implementation provided in Section 4. During each iteration this MAXnet subtracts the mean value of positive node outputs from the initial value of each node. After applying the Heaviside function, one or more node outputs are zeroed out, while at least one remains positive. This behavior is guaranteed by a theorem of Hardy et al. (1952), which states that if b is a set of nonnegative numbers, then minimum ( b ) < mean ( b ) < maximum (b), unless all the elements of b are equal. This formulation preserves the maximum value while determining its location. The termination condition for equation 2.1 can be viewed in one of two ways: (1) a single positive output node or (2) a constant number of output nodes. The first way of describing the termination condition permits the new MAXnet to converge to a unique maximum, while the latter way permits the new MAXnet to converge to either a unique maximum or multiple maxima. The purpose of the Heaviside function in equation 2.1 is to zero out the node outputs, which are less than the mean value of the positive node outputs. In this way, the maximum is obtained. If the negative of the argument of the Heaviside function is utilized in equation 2.1, then the effect will be to zero out the vj(t)s, which are greater than the mean value of the positive v,(t)s. The resulting minimum picking net (MINnet) is defined by

+ y'o];

vj(t + 1) = vj(O)H[-uj(O)

-

j

= 1 , .. . ,

M

(2.4)

where v*(t)is given by equation 2.2. If this MINnet is used with Kohonen nets (Kohonen 19891, then the input to equation 2.4, v,(O), would be

226

Bruce W. Suter and Matthew Kabrisky

the Euclidean norm of the difference between the input and each of the elements of the topological surface. 3 Analysis of New Algorithm

The behavior of the new MAXnet is dependent on the distribution of input data. In the best case, the maximum input is much greater than all other inputs. Then, the MAXnet will converge to the maximum in a single iteration. In the worst case, for every iteration the mean value of positive node outputs always lies between the smallest positive v,(t) and the next larger vl(t).In this worst case scenario, only one v,(t) will be zeroed out in each iteration. As a result, in the worst case, the new MAXnet algorithm will converge to the maximum in M iterations. Fortunately, for many commonly occurring input distributions, the performance of the MAXnet algorithm is O(1ogM). This will be shown analytically for ramped inputs ~ ~ (=0j;)j = 1,.. . ,M and empirically for 1/X inputs ~ ~ (=0O.l/j; ) j = 1,.. . ,M and for one-sided gaussian inputs. First, consider ramped inputs defined by v,(O) = j ; j = 1,.. . ,M. During the first iteration, the mean value of inputs is (M + 1)/2. Since the MAXnet will zero out all vl(t)sless than the mean value, the number of terms zeroed out in the first iteration is O(M/2). Similarly, during the second iteration, the mean value of positive v,(t)sis (3M+2)/4. Since the MAXnet will zero out all v,(t)s less than the mean value, the number of terms zeroed out in the second iteration is O(M/4).Thus, the ith iteration will zero O(M/2') elements. This suggests that only O(1ogM) iterations will be required to find the maximum. Now consider 1/X inputs defined by v,(O) = O.l/j; j = l , . ..,M. Table 1 depicts the number of iterations required to find the maximum when the number of inputs varies from 5 to 5000. By observation, only O(1ogM) iterations are required. Next, consider M inputs from one side of a gaussian distribution with mean of zero and unity variance. For this case, finding the maximum corresponds to finding the tail of the distribution. Table 1 depicts the number of iterations required to find the maximum when the number of inputs varies from 5 to 5000. As in the last case, only O(1ogM) iterations are required. 4 Hardware Implementation of New MAXnet Algorithm

This section shows that it is possible to implement the new MAXnet algorithm using four basic building blocks. This implementation is constructed using a combination of threshold logic devices, M-input summers, 2-input multipliers, and a 2-input divider. The building blocks are illustrated in Figure 1. (It is important to note that after inputs are pre-

Iterative MAXnet Algorithm

227

Table 1: Number of Iterations Required to Locate Maximum. ~

~

Iterations Required

M

5 10 50 100 500 1000 5000

'M

Distribution One side of described by N(0,l) gaussian vj(0)= O.l/j; j = 1,. .., M distribution

2 2 3 4 4 5 5

.

.

'1

'2

s

s

(a)

(b)

i,

i,

1 1 1, if i ? t , otherwise

S (C)

S

(d)

Figure 1: Building blocks for hardware implementation of new MAXnet algorithm. (a) M-input summer, (b) 2-input divider, (c) 2-input multiplier, and (d) threshold logic device. When an input is presented to a building block, a finite delay is encountered before the corresponding change at the building block output is observed.

oulpul 2

-

.:

Figure 2 Implementation of new algorithm using equations 2.1 and 2.2. Output 1 will be the initial magnitude for the node with maximum input and zero otherwise; while output 2 will be "1" if the node corresponds to a maximum input and " 0 otherwise. Output 1 and output 2, together with the two summers and divider, is used to compute the threshold (the mean value of the positive circuit outputs) for the next iteration. The temporal component of the MAXnet circuit is the delay from the generation of outputs 1 and 2 until the next threshold value is calculated.

OUIPUl

Iterative MAXnet Algorithm

229

sented to a building block, a finite delay will be encountered before the corresponding change at the building block output is observed.) The resulting hardware implementation of the new MAXnet algorithm is given in Figure 2. Initially, the inputs vj(O),j = 1,.. . , M I are presented to the circuit. The threshold logic units, together with the multipliers, compare the inputs to a threshold, and generate an output (denoted output 1)that preserves the magnitude of the inputs. The rest of the circuitry is used to compute the threshold. The other threshold logic units together with summer C provide a count of the number of positive outputs; while the corresponding threshold logic units and multipliers together with summer C provide the sum of the positive outputs of the circuit. The outputs of the two summers go to a divider, which generates the mean value of the positive circuit outputs. This output of the divider is the threshold for the next iteration of the MAXnet circuit. (The time required to generate the threshold is the temporal component or delay of the circuit.) As soon as the output settles (converges) the maximum has been determined. Two network outputs are generated from the circuit. As stated above, the outputs off the multiplier units, denoted “output 1,” preserves the initial magnitude for the node with maximum input and is zero otherwise; while the outputs off the threshold logic units, denoted ”output 2,” will be “1” if the node corresponds to the node with the maximum input and ” 0 otherwise. 5 Real-Time Performance of New MAXnet Algorithm

The new MAXnet algorithm can be applied to process time varying input signals. The only constraint in this process is that the time to compute the maximum value, TMAX, is less than the speed at which signals are that is, updated, TU~DATE,

TMAX TUPDATE

(5.1)

where (5.2)

Here, At is the time required for one MAXnet iteration and NI is the number of iterations required. The value of At is dependent on the particular hardware/software configuration, while NI is bounded by M. (For the MAXnet circuit given in the previous section, At corresponds to the delay of an M input adder plus the delay of a divider.) By making A$ dependent on the distribution of input data, TMAX can frequently be made much smaller. Thus, as long as equation 5.1 is satisfied, the new MAXnet can run with continuously time varying inputs and select a maximum continuously.

Bruce W. Suter and Matthew Kabrisky

230

6 Comparison to Other MAXnet Algorithms -

Lippmann et al. (1987) defined the node output for the jth MAXnet node, vJ(t),for t 2 0 as q ( t f 1 ) = f , [ v , ( t ) - c ~ v k ( t ) ] ; j = 1 , ....M

(6.1)

k#j

where t

5 1/(M - 1)

E is a constant scale factor representing the amount of lateral inhibition between neurons within the network. The neural threshold function is defined by

This paradigm creates a "winner-take-all" response mechanism, which determines the location of the maximum. For both the new MAXnet algorithm and Lippmann, Gold, and Malpass' MAXnet, the distribution of input data values is a significant factor controlling the convergence time. Consider M inputs defined by v,(O) = O.l/j; j = 1,.. . ,M. Table 2 depicts the number of iterations required to find the maximum when t = 1/(M-1) and the number of inputs varies from 5 to 5000. By observation, O(M) iterations are required. Recall that the new MAXnet algorithm required only O(1ogM) iterations. Now, consider M inputs from one side of an N ( 0 , l ) gaussian distribution. For this case the maximum corresponds to finding the tail of the distribution. Table 2 depicts the number of iterations required to find the maximum when E = 1/(M - 1) and the number of inputs varies from 5 to 5000. As in the last case O ( M ) iterations are required, while the new MAXnet algorithm required only O(1ogM) iterations. Hence, the new MAXnet algorithm has been shown empirically to converge substantially faster than Lippmann, Gold, and Malpass' MAXnet. It is interesting to note that the differential equation which corresponds to equation 2.1,

is not of the same form as assumed by Grossberg (19731, (6.3)

Iterative MAXnet Algorithm

231

TabIe 2: Number of Iterations Required to Locate Maximum with Lippman, Gold, and Malpass’ MAXnet [ E = 1/(M - l)].

M

5 10 50 100 500 1000 5000

Iterations required Distribution One side of described by N(0,l) gaussian v,(O)=O.l/j;j=l, . . . , n / r distribution 3 2 4 5 24 29 48 127 243 951 486 1726 2435 3169

where Z(t) is the number of positive outputs from the previous iteration, f is a nonlinearity, and A, B, and C could be functions of xj and t. Let A = 1, B = xi 1, C = 6, and f = f t , then equation 6.3 becomes

+

(6.4)

which is the differential equation corresponding to equation 6.1. Therefore, for constant E , equation 6.4 is the differential equation that describes Lippmann, Gold, and Malpass’ MAXnet. As noted above, the new MAXnet algorithm is substantially faster than Lippmann, Gold, and Malpass’ algorithm. Winters and Rose (1989) recently defined a MAXnet algorithm that can be thought of as a tree structure that has been collapsed to a single one-dimensional array. As such, propagation to the next level is replaced by iteration; and the reduced number of cells at the subsequent levels of the tree are replaced by the action of making cells “passive” and permitting signals from adjacent cells to propagate through them. As such, Winters and Rose require switches at the input to the cell and on the lines from the two adjacent cells. In the event of a tie, the leftmost cell wins. This complicates the cell’s logic by requiring cells with slightly different thresholds. The number of iterations required to locate a maximum with Winters and Rose’s MAXnet is O(1ogM). Table 3 depicts this performance for M inputs from one side of an N(0,l) gaussian distribution. As noted in Section 3, for many commonly occurring distributions of inputs, the new

Bruce W. Suter and Matthew Kabrisky

232

Table 3: Number of Iterations Required to Locate Maximum with Winters and Rose’s MAXnet for One Side of N ( 0 , l ) Gaussian Distribution.

M Iterations required 5 2 10 2 50 4 100 4 500

6

1000 5000

7 8

MAXnet algorithm is O(1ogM). Hence, the performance of Winters and Rose’s algorithm is comparable to that of the new MAXnet algorithm.

7 Conclusions This paper presented a new, different way to select a maximum value with distributed parallel computation. The behavior of the new MAXnet algorithm with M inputs is dependent on the distribution of input data: the net converges in O(1ogM) iterations using simple computing elements for many common input and amplitude distributions. References

Grossberg, S. 1973. Contour enhancement, short term memory and constancies in reverberating neural networks. Stud. Appl. Math. LII, 213-257. Hardy, G. H., Littlewood, J. E., and Polya, G. 1952. Inequalities. Cambridge University Press, Cambridge, England. Hopfield, J. J. 1984. Neurons with graded response have collective computation properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092.

Kohonen, T. 1989. Self Organization and Associative Memoy. Springer-Verlag, New York. Lazzaro, J., Rychebush, M., Mahowald, A., and Mead, C. A. 1988. Winnertake-all networks of O ( n )complexity. Caltech Dept. of Computer Science Tech. Rep. TR-88-12. Lippmann, R. P., Gold, B., and Malpass, M. L. 1987. AcomparisonofHammingand Hopfield neural nets for pattern classification. M.I.T. Lincoln Laboratory Tech. Rep. TR-769.

Iterative MAXnet Algorithm

233

Martin, T. 1970. Acoustic recognition of a limited vocabulary in continuous speech. Ph.D. dissertation, University of Pennsylvania. Mead, C. 1989. Analog VLSl and Neural Systems. Addison-Wesley, Reading, MA. Winters, J. H., and Rose, C. 1989. Minimum distance automata in parallel networks for optimum classification. Neural Networks 2, 127-132.

Received 19 December 1989; accepted 26 September 1991.

This article has been cited by: 2. M. Mestari. 2004. An Analog Neural Network Implementation in Fixed Time of Adjustable-Order Statistic Filters and Applications. IEEE Transactions on Neural Networks 15:3, 766-785. [CrossRef] 3. S. Pal, A. Datta, N.R. Pal. 2001. A multilayer self-organizing model for convex-hull computation. IEEE Transactions on Neural Networks 12:6, 1341-1347. [CrossRef] 4. Chi-Ming Chen, Jer-Ferr Yang. 2000. Layer Winner-Take-All neural networks based on existing competitive structures. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:1, 25-30. [CrossRef] 5. Yingquan Wu, S.N. Batalama. 2000. An efficient learning algorithm for associative memories. IEEE Transactions on Neural Networks 11:5, 1058-1066. [CrossRef] 6. Jar-Ferr Yang, Chi-Ming Chen. 1997. A dynamic K-winners-take-all neural network. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:3, 523-526. [CrossRef] 7. T.S. Dranger, R. Priemer. 1997. Collective process circuit that sorts. IEE Proceedings - Circuits, Devices and Systems 144:3, 145. [CrossRef] 8. Yuen-Hsien Tseng, Ja-Ling Wu. 1995. On a constant-time, low-complexity winner-take-all neural network. IEEE Transactions on Computers 44:4, 601-604. [CrossRef]

Communicated by Fernando Pineda

Learning Complex, Extended Sequences Using the Principle of History Compression Jiirgen Schmidhuber Department of Computer Science, University of Colorado, Campus Box 430, Boulder, C O 80309 U S A

Previous neural network learning algorithms for sequence processing are computationally expensive and perform poorly when it comes to long time lags. This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of infowation. A consequence of this principle is that only unexpected inputs can be relevant. This insight leads to the construction of neural architectures that learn to ”divide and conquer” by recursively decomposing sequences. I describe two architectures. The first functions as a selforganizing multilevel hierarchy of recurrent networks. The second, involving only two recurrent networks, tries to collapse a multilevel predictor hierarchy into a single recurrent net. Experiments show that the system can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets. 1 Introduction

Several approaches to on-line supervised sequence learning have been proposed, including backpropagation through time (BPTT) (e.g., Williams and Peng 1990),the IID or RTRL algorithm (Robinson and Fallside 1987; Williams and Zipser 19891, and the recent fast-weight algorithm (Schmidhuber 1991b). These approaches are computationally intensive; BPTT is not local in time, RTRL-like algorithms are not local in space (Schmidhuber 1991~).Common to all of these approaches is that they do not try to selectively focus on refeuanf inputs; they waste efficiency and resources by focusing on every input. With many applications, a second drawback of these methods is the following: The longer the time lag between an event and the occurrence of a corresponding error the less information is carried by the corresponding backpropagated error signals. Mozer (1990) and Rohwer (1989) have addressed the latter problem but not the former. Ring (1991) on the other hand addresses both problems, but in a manner much different from that presented here. How can a system learn to focus on the relevant points in time? What does it mean for a point in time to be relevant? How can the system learn Neural Computation 4, 234-242 (1992)

@ 1992 Massachusetts Institute of Technology

Learning Complex, Extended Sequences

235

to reduce the numbers of inputs to be considered over time without losing information? A major contribution of this work is an adaptive method for removing redundant information from sequences. The next section shows that the system ought to focus on unexpected inputs and ignore expected ones. 2 History Compression

Consider a deterministic discrete time predictor (not necessarily a neural network) whose state at time t is described by an environmental input vector i(t), an internal state vector h(t), and an output vector o(t). The environment may be nondeterministic. At time 0, the predictor starts with i(0) and an internal start state h(0). At time t 2 0, the predictor computes

40 = f [ i ( t )h(t)l ? At time t > 0, the predictor furthermore computes

h ( t ) = g[i(t- l),h(t - l)] AII information about the input at a given time tx can be reconstructed from the knowledge about

tx,f,g,i(0),h(O),andthe pairs [t,,i(t,)]forwhich 0 < t, 5 t, and o(ts - 1) # i(ts) This is because if o ( t ) = i(t + 1) at a given time t, then the predictor is able to predict the next input from the previous ones. The new input is derivable by means off and g. Information about the observed input sequence can be even further compressed beyond just the unpredicted input vectors i(ts). It suffices to know only those elements of the vectors i(ts) that were not correctly predicted. This observation implies that we can discriminate one sequence from another by knowing just the unpredicted inputs and the corresponding time steps at which they occurred. No information is lost if we ignore the expected inputs. We do not even have to know f and g. I call this the principle of histoy compression. From a theoretical point of view it is important to know at what time an unexpected input occurs; otherwise there will be a potential for ambiguities: Two different input sequences may lead to the same shorter sequence of unpredicted inputs. With many practical tasks, however, there is no need for knowing the critical time steps, as I show later.

236

Jiirgen Schmidhuber

3 A Self-organizing Multilevel Predictor Hierarchy

Using the principle of history compression we can build a self-organizing hierarchical neural "chunking" system. The system detects causal dependencies in the temporal input stream and learns to attend to unexpected inputs instead of focusing on every input. It learns to reflect both the relatively local and the relatively global temporal regularities contained in the input stream. The basic task can be formulated as a prediction task. At a given time step the goal is to predict the next input from previous inputs. If there are external target vectors at certain time steps then they are simply treated as another part of the input to be predicted. The architecture is a hierarchy of predictors, the input to each level of the hierarchy is coming from the previous level. Pi denotes the ith level network, which is trained to predict its own next inputfrom its previous inputs.' We take Pi to be a conventional dynamic recurrent neural network (Robinson and Fallside 1987; Williams and Zipser 1989; Williams and Peng 1990); however, it might be some other adaptive sequence processing device as we1L2 At each time step the input of the lowest level recurrent predictor PO is the current external input. We create a new higher level adaptive predictor P,+, whenever the adaptive predictor at the previous level, P,, stops improving its predictions. When this happens the weight-changing mechanism of P, is switched off (to exclude potential instabilities caused by ongoing modifications of the lower level predictors). If at a given time step P, (s 2 0) fails to predict its next input (or if we are at the beginning of a training sequence that usually is not predictable either) then Psfl will receive as input the concatenation of this next input of P, plus a unique representationof the corresponding time step;3 the activations of P,+I'S hidden and output units will be updated. Otherwise P,+l will not perform an activation update. This procedure ensures that Ps+l is fed with an unambiguous reduced description4 of the input sequence observed by P,. This is theoretically justified by the principle of history compression. In general, P,+1 will receive fewer inputs over time than P,. With existing learning algorithms, the higher level predictor should have less 'Recently I became aware that Don Mathis had some related ideas (personal communication). A hierarchical approach to sequence generation was pursued by Miyata (1988). See Ring (1991) for an alternative method of building "chunking" hierarchies. 2For instance, we might employ the more limited feedforward networks and a "time window" approach. In this case, the number of previous inputs to be considered as a basis for the next prediction will remain fixed. 3A unique time representation is theoretically necessary to provide Ps+l with unambiguous information about when the failure occurred (see also the last paragraph of Section 2). A unique representation of the time that went by since the last unpredicted input occurred will do as well. 41n contrast, the reduced descriptions referred to by Mozer (1990) are not unambiguous.

Learning Complex, Extended Sequences

237

difficulties in learning to predict the critical inputs than the lower level predictor. This is because P,+l’s “credit assignment paths” will often be short compared to those of P,. This will happen if the incoming inputs carry global temporal structure that has not yet been discovered by P,. This method is a simplification and an improvement of the recent chunking method described by Schmidhuber (1991a). Often a multilevel predictor hierarchy will be the fastest way of learning to deal with sequences with multilevel temporal structure (e.g., speech). Experiments have shown that multilevel predictors can quickly learn tasks that are practically unlearnable by conventional recurrent networks (e.g., Hochreiter 1991). One disadvantage of a predictor hierarchy, however, is that it is not known in advance how many levels will be needed. Another disadvantage is that levels are explicitly separated from each other. It can be possible, however, to collapse the hierarchy into a single network as described next. 4 Collapsing the Hierarchy into a Single Recurrent Net 4.1 Outline. I now describe an architecture consisting of two conventional recurrent networks: The automatizer A and the chunker C . At each time step A receives the current external input. A’s error function is 3-fold: One term forces it to emit certain desired target outputs at certain times. If there is a target, then it becomes part of the next input. The second term forces A at every time step to predict its own next nontarget input. The third (crucial) term will be explained below. If and only if A makes an error concerning the first and second term of its error function, the unpredicted input (including a potentially available teaching vector) along with a unique representation of the current time step will become the new input to C . Before this new input can be processed, C (whose last input may have occurred many time steps earlier) is trained to predict this higher level input from its current internal state and its last input (employing a conventional recurrent net algorithm). After this, C performs an activation update that contributes to a higher level internal representation of the input history. Note that according to the principle of history compression C is fed with an unambiguous reduced description of the input history. The information deducable by means of A’s predictions can be considered as redundant. (The beginning of an episode usually is not predictable, therefore it has to be fed to the chunking level, too.) Since C‘s “credit assignment paths” will often be short compared to those of A, C will often be able to develop useful internal representations of previous unexpected input events. Due to the final term of its error function, A will be forced to reproduce these internal representations, by predicting C’s state. Therefore A will be able to create useful internal representations by itself in an early stage of processing a given sequence; it will often receive meaningful error signals long before errors of the first

238

Jurgen Schmidhuber

or second kind occur. These internal representations in turn must carry the discriminating information for enabling A to improve its low-level predictions. Therefore the chunker will receive fewer and fewer inputs, since more and more inputs become predictable by the automatizer. This is the collapsing operation. Ideally, the chunker will become obsolete after some time. It must be emphasized that unlike the incremental creation of a multilevel predictor hierarchy described in Section 3 there is no formal proof that the 2-net on-line version is free of instabilities. For instance, one can imagine situations where A unlearns previously learned predictions because of the third term of its error function. Relative weighting of the different terms in A's error function represents an ad hoc remedy for this potential problem. In the experiments (presented in Section 5 ) relative weighting was not necessary. 4.2 Details of the 2-Net Chunking Architecture. The system described below is the on-line version of a representative of a number of variations of the basic principle described in Section 4.1. See Schmidhuber (1991~)for various modifications. Table 1 gives an overview of various time-dependent activation vectors relevant for the description of the algorithm. Additional nofation: "0" is the concatenation operator; & ( t ) = 1 if the teacher provides a target vector d ( f ) at time t and b d ( t ) = 0 otherwise. If & ( t ) = 0 then d ( f ) takes on some default value, for example, the zero vector. A has nI + nD input units, nHA hidden units, and no* output units (see Table 1). With pure prediction tasks nD = 0. C has nHc hidden units, and noc output units. All of A's input and hidden units have directed connections to all of A's hidden and output units. All input units of A have directed connections to all hidden and output units of C. This is because A's input units seme as input units for C at certain time steps. There are additional ntimeinput units for C for providing unique representations of the current time step. These additional input units also have directed connections to all hidden and output units of C. All hidden units of C have directed connections to all hidden and output units of C. A will try to make d A ( t ) equal to d ( t ) if & ( t ) = 1, and it will try to make pA(t)equal to x ( t ) , thus trying to predict x ( t ) . Here again the target prediction problem is defined as a special case of an input prediction problem. C will try to make dc(t) equal to the externally provided teaching vector d(t) if & ( t ) = 1 and if A failed to emit d ( t ) . Furthermore, it will always try to make pc(t) o sc(t) equal to the next nonteaching input to be processed by C. This input may be many time steps ahead. Finally, and most importantly, A will try to make qA(t) equal to hc(t)o oc(t), thus trying to predict the state of C. The activations of C's output units are considered as part of its state. Both C and A simultaneously are trained by a conventional algorithm for recurrent networks in an on-line fashion. Both the IID algorithm

Learning Complex, Extended Sequences

239

Table 1: Definitions of Symbols Representing Time-Dependent Activation Vectors." Vector

Description (referringto time t ) "Normal" environmental input Teacher-defined target i A ( t ) = x ( t ) o d ( t ) A's input hA(t) A's hidden activations dA(t) A's prediction of d ( t ) PA ( t ) A's prediction of x(t) time(t) Unique representation of t k(t) C's hidden activations dC(t) C's prediction of C's next target input pc(t) C's prediction of C's next "normal" input SC(t) C's prediction of C's next "time" input OC(t) dc(t)o p c ( t 1 o s c ( t ) A's prediction of hc(t)o o c ( t ) 9A(t) 4t) d(t)

OA(t)

dA(t) PA(t) qA(t)

Dimension nI nD 111

+ nD

n H ~ nD

n1 %me

nHc nD nI ntime

+ nI + ntime noA= nD + nI + nHc no,

nHc

= HD

+ no,

+not 0" . 0 is the concatenation operator. h ~ ( tand ) o A ( t ) are based on previous inputs and are computed without knowledge about d ( t ) and x ( t ) . I,

and BPTT are appropriate. In particular, computationally inexpensive variants of BPTT (Williams and Peng 1990) are interesting: There are tasks with hierarchical temporal structure where only a few iterations of "backpropagation back into time" per time step are in principle sufficient to bridge arbitrary time lags (see Section 5). I now describe the (quite familiar) procedure for updating activations in a net. Repeat for a constant number of iterations (typically one or two): 1. For each noninput unit j of N compute clj = fi(ciaiwij), where aj is the current activation of unit j , fi is a semilinear differentiable function, and wil is the weight on the directed connection from unit i to unit j .

2. For all noninput units j : set aj equal to u,. I now specify the input-output behavior of the chunker and the automatizer as well as the details of error injection. Initialization: All weights are initialized randomly. In the beginning, at time step 0, make hc(0) and h ~ ( 0equal ) to zero, and make iA(0) equal d ( 0 ) o x ( 0 ) . Represent time step 0 in time(0). Update C to obtain hc(1) and oc(1).

Jiirgen Schmidhuber

240

For all times t > 0 until interruption do: 1. Update A to obtain hA(f) and oA(t). A's error eA(t) is defined as

2eA(f) =

(PA(t)

qA(t) - x ( t ) k ( t )o C ( t ) l T

X[PA(f)

qA(t)

-x(t)

oC(t)]

hC(t)

f6d(t)[dA(t)- d(t)lT[dA(t)

-

d(t)l

Use a gradient descent algorithm for dynamic recurrent nets to of A in proportion to (the approximation change each weight w,, of) -i3eA(t)/i3wij. Set iA(t) to d ( t ) o x ( t ) . Uniquely represent t in time(t) . 2. If A's low-level error 2eP(t) =

[pA(t)

- X(f)lTbA(t)

-

x(t)1

- d(t)l

f h d ( f ) [ d A ( t ) - d(t)lT[dA(t)

is less or equal to a small constant B 2 0, then set hc(t oc(t

+ 1) = oc(t).

+ 1) = Iic(t),

Else define C's prediction error ec(t) as 2eC(t)

[pC(t)- x(t)l'bC(t) - X(t)l + bd(t)[dC(t)- d(f)lT x [ d c ( t )- d ( t ) ] + [sc(t) - time(t)IT[sc(t)- time(t)]

use a gradient descent algorithm for dynamic recurrent nets to change each weight w,j of C in proportion to (the approximation of) -dec(t)/awij, and update C to obtain hcit 1) and oc(t 1).

+

+

5 An Experiment

Josef Hochreiter (a student at TUM) tested a chunking system against a conventional recurrent net algorithm. See Hochreiter (1991) and Schmidhuber (1991~)for details. A prediction task with a 20-step time lag was constructed. There were 22 possible input symbols a , x , b l , b2,. . . , b20. The learning systems observed one input symbol at a time. There were only two possible input sequences: a b l , . . . , bZ0 and xbl,. . . , b20. These were presented to the learning systems in random order. At a given time step, one goal was to predict the next input (note that in general it was not possible to predict the first symbol of each sequence due to the random occurrence of x and a). The second (and more difficult) goal was to make the activation of a particular output unit (the "target unit") equal to 1 whenever the last 21 processed input symbols were a , b l , . , . ,b20 and to make this activation 0 whenever the last 21 processed input symbols were x, bl, ... ,b20. No episode boundaries were used: Input sequences were fed to the learning systems without providing information about their beginnings and their ends. Therefore there was a continuous stream of input events.

Learning Complex, Extended Sequences

241

With the conventional algorithm, with various learning rates, and with more than 1,000,000 training sequences it was not possible to obtain a significant performance improvement concerning the target unit. A similar task involving time lags of as few as 5 steps required many hundreds of thousands of training sequences. But, a chunking system was able to solve the 20-step task rather quickly, using an efficient approximation of the BPTT method where error was propagated a maximum of 3 steps into the past (although there was a 20 step time lag!). No unique representations of time steps were necessary for this task. Out of 17 test runs 13 required fewer than 5000 training sequences. The remaining test runs required fewer than 35,000 training sequences. Typically, A quickly learned to predict the "easy" symbols b2,.. . , b20. This led to a greatly reduced input sequence for C, which now did not have many problems in learning to predict the target values at the end of the sequences. After a while A was able to mimic C's internal representations, which in turn allowed it to learn correct target predictions by itself. A's final weight matrix often looked like one that one would hope to get from the conventional algorithm: There were hidden units that learned to bridge the 20-step time lags by means of strong self-connections. The chunking system needed less computation per time step than the conventional method. Still it required many fewer training sequences. 6 Concluding Remarks

It seems that people tend to memorize and focus on atypical or unexpected events and that they often try to explain new atypical events in terms of previous atypical events. In light of the principle of history compression this makes a lot of sense. Once events become expected, they tend to become "subconscious." There is an obvious analogy to the chunking algorithm: The chunker's attention is removed from events that become expected; they become "subconscious" (automatized) and give rise to even higher level "abstractions" of the chunker's " c o ~ s c ~ ~ u s ~ ~ s s . " ~ The chunking systems described in Schmidhuber (1991a,c) and the current paper try to detect temporal regularities and learn to use them for identifying relevant points in time. A general criticism of more conventional algorithms can be formulated as follows: These algorithms do not try to selectively focus on relevant inputs, they waste efficiency and resources by focusing on every input. Speech is a good example of a domain involving multilevel temporal structure. Ongoing research will explore the application of chunking systems to speech recognition. 5This distinction between attended and automized events can also be found in the systems of Myers (1990) and of Ring (1991).

242

Jiirgen Schmidhuber

The principle of history compression is not limited to neural networks. Any adaptive sequence processing device could make use of it.

Acknowledgments Thanks to Josef Hochreiter for conducting the experiments. Thanks to Mike Mozer for useful comments on a n earlier draft of this paper. This research w a s supported by NSF PYI award IRI-9058450, grant 9021 from the James S. McDonnell Foundation, a n d DEC external research grant 1250 to Michael C. Mozer.

References Hochreiter, J. 1991. Diploma thesis. Institut fur Informatik, Technische Universitat Miinchen. Miyata, Y. 1988. An unsupervised PDP learning model for action planning. In Proceedings of the Tenth Annual Conference of the Cognitive Science Society, pp. 223-229. Erlbaum, Hillsdale, NJ. Mozer, M. C. 1990. Connectionist music composition based on melodic, stylistic, and psychophysical constraints. Tech. Rep. CU-CS-495-90, University of Colorado at Boulder. Myers, C. 1990. Learning with delayed reinforcement through attention-driven buffering. Tech. Rep., Imperial College of Science, Technology and Medicine. Ring, M. 1991. Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Machine Learning: Proceedings of the Eighth Znternational Workshop (ML91), L. Birnbaum and G. Collins, eds., pp. 343-347. Morgan Kaufmann. Robinson, A. J., and Fallside, F. 1987. The utility driven dynamic error propagation network. Tech. Rep. CUED/F-INFENG/TR.l, Cambridge University Engineering Department. Rohwer, R. 1989. The 'moving targets' training method. In Proceedings of 'Distributed Adaptive Neural Information Processing', St. Augustin, 24.-25.5, J. Kindermann and A. Linden, eds. Oldenbourg. Schmidhuber, J. H. 1991a. Adaptive decomposition of time. In Artificial Neural Networks, T. Kohonen, K. Makisara, 0. Simula, and J. Kangas, eds., pp. 909914. Elsevier Science Publishers B.V., Amsterdam. Schmidhuber, J. H. 1991b. Learning to control fast-weight memories: An alternative to recurrent nets. Tech. Rep. FKI-147-91, Institut fur Informatik, Technische Universitat Munchen. Schmidhuber, J. H. 1991~.Neural sequence chunkers. Tech. Rep. FKI-148-91, Institut fur Informatik, Technische Universitat Miinchen. Williams, R. J., and Peng, J. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comp. 4, 491-501. Williams, R. J., and Zipser, D. 1989. Experimental analysis of the real-time recurrent learning algorithm. Connect. Sci. 1(1), 87-111. Received 16 May 1991; accepted 20 September 1991.

This article has been cited by: 2. Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, Gerhard Rigoll. 2010. Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework. Cognitive Computation 2:3, 180-190. [CrossRef] 3. Florian Eyben, Martin Wöllmer, Alex Graves, Björn Schuller, Ellen Douglas-Cowie, Roddy Cowie. 2010. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces 3:1-2, 7-19. [CrossRef] 4. Jinmiao Chen, N.S. Chaudhari. 2009. Segmented-Memory Recurrent Neural Networks. IEEE Transactions on Neural Networks 20:8, 1267-1280. [CrossRef] 5. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 6. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 7. Ben Choi. 2003. Inductive Inference by Using Information Compression. Computational Intelligence 19:2, 164-185. [CrossRef] 8. F.A. Gers, E. Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks 12:6, 1333-1340. [CrossRef] 9. R. Sun, C. Sessions. 2000. Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:3, 403-418. [CrossRef] 10. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef] 11. Sepp Hochreiter , Jürgen Schmidhuber . 1997. Long Short-Term MemoryLong Short-Term Memory. Neural Computation 9:8, 1735-1780. [Abstract] [PDF] [PDF Plus] 12. Tsungnan Lin, B.G. Horne, P. Tino, C.L. Giles. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks 7:6, 1329-1338. [CrossRef] 13. J. Schmidhuber, S. Heil. 1996. Sequential neural text compression. IEEE Transactions on Neural Networks 7:1, 142-146. [CrossRef]

14. Jürgen Schmidhuber . 1992. Learning Factorial Codes by Predictability MinimizationLearning Factorial Codes by Predictability Minimization. Neural Computation 4:6, 863-879. [Abstract] [PDF] [PDF Plus] 15. Jürgen Schmidhuber . 1992. A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running NetworksA Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks. Neural Computation 4:2, 243-248. [Abstract] [PDF] [PDF Plus]

Communicated by Fernando Pineda

A Fixed Size Storage O(n3)Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks Jurgen Schmidhuber Department of Computer Science, University of Colorado, Campus Box 430, Boulder, C O 80309 U S A

The real-time recurrent learning (RTRL) algorithm (Robinson and Fallside 1987; Williams and Zipser 1989) requires O(n4)computations per time step, where n is the number of noninput units. I describe a method suited for on-line learning that computes exactly the same gradient and requires fixed-size storage of the same order but has an average time complexity per time step of O(n3). 1 Introduction

There are two basic methods for performing steepest descent in fully recurrent networks with n noninput units and m = O ( n ) input units. Backpropagation through time (BPTT) [e.g., Williams and Peng (1990)l requires potentially unlimited storage in proportion to the length of the longest training sequence but needs only O(n’) computations per time step. BPTT is the method of choice if training sequences are known to have less than O(n) time steps. For training sequences involving many more time steps than n, for training sequences of unknown length, and for on-line learning in general one would like to have an algorithm with upper bounds for storage and for computations required per time step. Such an algorithm is the RTRL algorithm (Robinson and .Fallside 1987; Williams and Zipser 1989). It requires only fixed-size storage of the order O(n3)but is computationally expensive: It requires O(n4) operations per time step.’ The algorithm described herein2 requires O(n3)storage, too. Every O ( n )time steps it requires O(n4) operations, but on all other time steps it requires only O(n2)operations. This cuts the average time complexity per time step to o(n3). ’Pineda has described another recurrent net algorithm that, as he states, “has some of the worst features of both algorithms” (Pineda 1990). His algorithm requires 1 0(n4) memory and 2 O(n4) computations per time step, if the number of time steps exceeds n. 2Since the acceptance of this paper for publication it has come to my attention that the same algorithm was derived by Ron Williams (Williams 1989; Williams and Zipser 1992).

Neural Computation 4,243-248 (1992) @ 1992 Massachusetts Institute of Technology

Jiirgen Schmidhuber

244 2

The Algorithm

The notation will be similar to the notation of Williams and Peng (1990). U is the set of indices k such that at the discrete time step t the quantity xk(t) is the output of a noninput unit k in the network. 1 is the set of indices k such that xk(t) is an external input for input unit k at time t. T(t) denotes the set of indices k E LI for which there exists a specified target value d k ( t ) at time t. Each input unit has a directed connection to each noninput unit. Each noninput unit has a directed connection to each noninput unit. The weight of the connection from unit j to unit i is denoted by wij. To distinguish between different “instances” of wij at different times, we let wq(t) denote a variable for the weight of the connection from unit j to unit i at time t. This is just for notational convenience: wij(t) = wij for all t to be considered. One way to visualize the wq(f) is to consider them as weights of connections to the fth noninput layer of a feedforward network constructed by “unfolding” the recurrent network in time [e.g., Williams and Peng (1990)l. A training sequence with s 1 time steps starts at time step 0 and ends at time step s. The algorithm below is of interest if s >> n (otherwise it is preferable to use BPTT). For k E U we define

+

netk(0) = 0,

Vf 2 0 : xk(t) = fk[netk(t)], Vt > o : netk(t 1) =

+

C

wkl(t

+l)xl(t)

(2.1)

IEUUl

where fk is a differentiable (usually semilinear) function. For all wq and for all I E U , t 2 0 we define

r

qij(t) =

dnetljf) = ~

awq

dnetlff) 7=1 aWij(7)

Furthermore we define ek(t)

=

E(t)

=

d k ( t ) - Xk(t) 1 5 C[ek(t)]’, kEU

if k E T ( t ) and 0 otherwise t

Etotal(t’,t ) =

C

E(7)

T=f’fl

The algorithm is a cross between the BPTT and the RTRL algorithm. The description of the algorithm will be interleaved with its derivation and some comments concerning complexity. The basic idea is: Decompose the calculation of the gradient into blocks, each covering O ( n )time steps. For each block perform n + 1 BPTT-like passes, one pass for calculating error derivatives, and n passes for calculating derivatives of the net-inputs to the n noninput units at the end of each block. Perform n + 1 RTRL-like calculations for integrating the results of these BPTT-like passes into the results obtained from previous blocks.

RTRL Algorithm

245

The algorithm starts by setting the variable t o c 0. to represents the beginning of the current block. Note that for all possible f , wv : 9k(O)= 0, dEtotal(O,O)/dwij= 0. The main loop of the algorithm consists of five steps. Step 1: Set h t O(n) (I recommend: k +- n). The quantity dEtota’(O,to)/awijfor all wij is already known and qfj(to) is known for all appropriate 1, i,j. There is an efficient way of computing the contribution of Etotal(O,t o h ) to the change in wq,Awq(t0 k):

+

+

Awjj(to + k)

=

--(Y

dEtotal(O,to + k)

=

-a

8Wij

+ k) c dEtotal(O, awij(7) to

T=1

where Q is a learning rate constant. Step 2: Let the network run from time step to to time step to k according to the activation dynamics specified in equation 2.1. If it turns out that the current training sequence has less than to + k time steps (i.e., k > s - t o) then k + s - to. If h = 0 then EXIT. Step 3: Perform a combination of a BPTT-like phase with an RTRLlike calculation for computing error derivatives as described next. We write

+

dEtotal(O,to + k) - 8Etota’(0,t o ) awl, awi,

+

aEtotal(tO,to + k)

aw,

-

dEtota’(O, to) aw,

-

(to, to

+ k)

7=1

(2.2)

The first term of equation 2.2 is already known. Consider the third term:

where Sj(7)

=-

+

dEtota’(to, to k) anet,( r )

For a given to, Si(.) can be computed for all i E LI,to 5 r 5 t o single h step BPTT-pass of the order O(kn2)operations: Si(r) = f:[neti(r)]e,(r)

r

if r

= to

+k 1

+ k with a

Jiirgen Schmidhuber

246

What remains is the computation of the second term of equation 2.2 for all wij, which requires O(n3)operations: (to, to

5

+h)

T=1 aEtot;wij(7)

keu

T=l

aEtota’(to,to + h ) anetk(to) dnetk(t0) awij(7)

+

Step 4: To compute 9f(to h ) for all possible I , i , j , perform n combinations of a BPTT-like phase with an RTRL-like calculation (one such combination for each I ) for computing as follows: q$o

+h)

=

-

anetl(to+ h ) dWij

5

= canetl(to+ h ) fo+h

dwij(7)

T=l

+

dnetl(to h )

T=l

dwij(7)

+ canetI(t0 + h ) 7=f0+1

awij(7)

+ aneti(.) + ‘ c anetI(t0 aneti(7) c + c h)

o+h

d~ij(7)

T=to+l

tofk

1

Ylk(to)h:(tO)

kEU

where Ylk(T) =

~i(.)xj(7 -

1)

(2.3)

7=to+1

+

anetr(to h ) anetk(7)

For a given to, a given I E U and for all i E U ,to I 7 I to + h the quantity Tli(7) can be computed with a single h step BPTT-like operation of the order O(hn2): if 7 = to

+h :

if I

=i

then

Y I , ( T ) = 1 else T I , ( T )= 0

For a given 1, the computation of equation 2.3 for all wi, requires O(n3 hn2) operations. Therefore Step 3 and Step 4 together require ( n 1)O(hn2+ n 3 ) operations spread over h time steps. Since h = O(n),O(n4) computations are spread over O ( n ) time steps. This implies un average of o(n3) computations per time step. The final step of the algorithm’s main loop is

+

+

RTRL Algorithm

247

Step 5: Set to e to + h and go to Step 1. The off-line version of the algorithm waits until the end of an episode (which needs not be known in advance) before performing weight changes. An on-line version performs weight changes each time Step 4 is completed. As formulated above, the algorithm needs O(n4) computations at its peak, every nth time step. Nothing prevents us, however, from distributing these O(n4) computations more evenly over n time steps. One way of achieving this is to perform one of the n BMT-like phases of Step 4 at each time step of the next “block” of n time steps. 3 Concluding Remarks

Like the RTRL-algorithm the method needs a fixed amount of storage of the order O(n3). Like the RTRL-algorithm [but unlike the methods described in Williams and Peng (1990) and Zipser (1989)l the algorithm computes the exact gradient. Since it is O ( n ) times faster than RTRL, it should be preferred. Following the argumentation in Williams and Peng (19901, continuous time versions of BPTT and RTRL (Pearlmutter 1989; Gherrity 1989) can serve as a basis for a correspondingly efficient continuous time version of the algorithm presented here (by means of Euler discretization). Many typical environments produce input sequences that have both local and more global temporal structure. For instance, input sequences are often hierarchically organized (e.g., speech). In such cases, sequencecomposing algorithms (Schmidhuber 1991a,b) can provide superior alternatives to pure gradient-based algorithms. Acknowledgments Thanks to Mike Mozer, Bernd Schiirmann, and Daniel Prelinger for providing useful comments on an earlier draft of this paper. This research was supported by NSF PYI award IRI-9058450, grant 9021 from the James S. McDonnell Foundation, and DEC external research grant 1250 to Michael C. Mozer. References Gherrity, M. 1989. A learning algorithm for analog fully recurrent neural networks. 1EEEIlNNS lnt. Joint Conf. Neural Networks, San Diego 1, 643-644. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1, 263-269.

248

Jiirgen Schmidhuber

Pineda, F. J. 1990. Time dependent adaptive neural networks. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 710-718. Morgan Kaufmann, San Mateo, CA. Robinson, A. J., and Fallside, F. 1987. The utility driven dynamic error propagation network. Tech. Rep. CUED/F-INFENG/TR.l, Cambridge University Engineering Department. Schmidhuber, J. H. 1991a. Adaptive decomposition of time. In Artificial Neural Networks, T. Kohonen, K. Makisara, 0.Simula, and J. Kangas, eds., pp. 909914. Elsevier Science Publishers B.V., North-Holland. Schmidhuber, J. H. 1991b. Learning complex, extended sequences using the principle of history compression. Neural Comp. 4, 234-242. Williams, R J. 1989. Complexity of exact gradient computation algorithms for recurrent neural networks. Tech. Rep. NU-CCS-89-27, Boston: Northeastern University, College of Computer Science. Williams, R. J., and Peng, J. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comp. 4, 491-501. Williams, R. J., and Zipser, D. 1989. Experimental analysis of the real-time recurrent learning algorithm. Connection Sci. l U ) , 87-111. Williams, R. J., and Zipser, D. 1992. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart, eds. Hillsdale, NJ: Erlbaum. Zipser, D. 1989. A subgrouping strategy that reduces learning complexity and speeds up learning in recurrent networks. Neural Comp. 1, 552-558. -~

~

Received 16 May 1991; accepted 20 September 1991.

This article has been cited by: 2. Zhenzhen Liu, I. Elhanany. 2008. A Fast and Scalable Recurrent Neural Network Based on Stochastic Meta Descent. IEEE Transactions on Neural Networks 19:9, 1652-1658. [CrossRef] 3. Simone Kühn, Wolf-Jürgen Beyn, Holk Cruse. 2007. Modelling memory functions with recurrent neural networks consisting of input compensation units: I. Static situations. Biological Cybernetics 96:5, 455-470. [CrossRef] 4. Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo, Faustino Gomez. 2007. Training Recurrent Networks by EvolinoTraining Recurrent Networks by Evolino. Neural Computation 19:3, 757-779. [Abstract] [PDF] [PDF Plus] 5. Orlando De Jesus, M.T. Hagan. 2007. Backpropagation Algorithms for a Broad Class of Dynamic Networks. IEEE Transactions on Neural Networks 18:1, 14-27. [CrossRef] 6. Alex Aussem . 2002. Sufficient Conditions for Error Backflow Convergence in Dynamical Recurrent Neural NetworksSufficient Conditions for Error Backflow Convergence in Dynamical Recurrent Neural Networks. Neural Computation 14:8, 1907-1927. [Abstract] [PDF] [PDF Plus] 7. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 8. Felix A. Gers , Jürgen Schmidhuber , Fred Cummins . 2000. Learning to Forget: Continual Prediction with LSTMLearning to Forget: Continual Prediction with LSTM. Neural Computation 12:10, 2451-2471. [Abstract] [PDF] [PDF Plus] 9. A.F. Atiya, A.G. Parlos. 2000. New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks 11:3, 697-709. [CrossRef] 10. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef] 11. Sepp Hochreiter , Jürgen Schmidhuber . 1997. Long Short-Term MemoryLong Short-Term Memory. Neural Computation 9:8, 1735-1780. [Abstract] [PDF] [PDF Plus] 12. G. Cauwenberghs. 1996. An analog VLSI recurrent neural network learning a continuous-time trajectory. IEEE Transactions on Neural Networks 7:2, 346-361. [CrossRef] 13. Kosei Demura, Yuichiro Anzai, Masahiro Kajiura. 1996. Recurrent SOLAR algorithm. Systems and Computers in Japan 27:11, 97-110. [CrossRef] 14. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]

Communicated by Eric Baum

How Tight Are the Vapnik-Chervonenkis Bounds? David Cohn Department of Computer Science and Engineering, University of Washington, Seattle, W A 98195 U S A Gerald Tesauro IBM Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598 U S A

We describe a series of numerical experiments that measure the average generalization capability of neural networks trained on a variety of simple functions. These experiments are designed to test the relationship between average generalization performance and the worstcase bounds obtained from formal learning theory using the VapnikChervonenkis (VC) dimension (Blumer e t al. 1989; Haussler et al. 1990). Recent statistical learning theories (Tishby et al. 1989; Schwartz et al. 1990) suggest that surpassing these bounds might be possible if the spectrum of possible generalizations has a ”gap” near perfect performance. We indeed find that, in some cases, the average generalization is significantly better than the VC bound: the approach to perfect performance is exponential in the number of examples m, rather than the l/m result of the bound. However, in these cases, we have not found evidence of the gap predicted by the above statistical theories. In other cases, we do find the l/m behavior of the VC bound, and in these cases, the numerical prefactor is closely related to the prefactor contained in the bound. 1 Introduction

The study of generalization is the study of how well a learning system can perform on inputs that it has not seen during training. Significant theoretical progress in the understanding of generalization has been made in the last few years utilizing a concept known as the Vapnik-Chervonenkis dimension, or VC-dimension (Blumer et al. 1989). Without going into a lot of detail, one can say that the VC-dimension is a measure of a learning system’s representational capacity or complexity. As a Iearning system’s VC-dimension increases, its capacity to represent a wide variety of functions increases, and it is intuitively clear that it needs more data to be trained accurately. For feedforward neural networks, the VCdimension can be calculated exactly in the single-layer case, and can be bounded above and below in the multilayer case (Baum and Haussler Neural Computation

4,249-269 (1992)

@ 1992 Massachusetts Institute of Technology

David Cohn and Gerald Tesauro

250

1989). The lower bound on the dimension is just the number of weights in the network, and this is often used as a rough estimate of the actual VC-dimension. Using the notion of VC-dimension, learning theorists have been able to prove a number of important theorems that provide worst-case bounds on the ability of arbitrary learning systems to generalize (Blumer et al. 1989; Haussler et al. 1990, 1991). These theorems are generally framed as follows: Suppose an arbitrary learning system of VC-dimension d is trained on a set of m examples of an arbitrary function that are chosen at random from an arbitrary but fixed probability distribution, and suppose that the learning system is able to correctly classify the examples. Then we can place an upper bound on the worst-case generalization error of the learner on future examples drawn from the same distribution: With high confidence, the worst-case generalization error will be less than or equal to c, for F

I 0

(;d In 7m ),

(Vapnik 1982)

A complementary worst-case lower bound due to Ehrenfeucht et al. (1988) demonstrates that there exist classes for which E 2 O(d/m), and thus bounds the worst case to within a logarithmic factor. If the error on the training examples is nonzero, similar statements can be made that bound how much worse the generalization error can be compared to the error on the training examples (Vapnik 1982; Blumer et al. 1989). The generality of the above-mentioned theorems is impressive: they apply regardless of what type of learning system is employed, what type of input-output function is being learned, or what kind of probability distribution is used to generate the examples. Yet it is precisely the broad generality of these worst-case bounds that leads one to question whether they are relevant to a given learning system’s average or expected generalization performance. It may well be the case that many classes of naturally occurring real-world problems are much easier to learn, and would have much lower expected errors than that suggested by the above worst-case bounds. On the other hand, it may also be possible that the bounds are tight and are characteristic of typical errors, even for reasonable, “nonmalicious” functions and distributions of examples. A variation on this approach describes worst-case upper bounds for a learner following a particular strategy. Haussler et al. (1990) describe a prediction strategy for binary classification problems that has a worstcase upper bound generalization error of E 5 d / m , and this bound is shown in Haussler et al. (1991) to apply to any Bayes-optimal prediction strategy. We will thus refer to this bound as the “Bayes-optimal bound.” It more closely reflects what is thought by many to be the “typical” generalization behavior of common learning problems. Empirical support for this hypothesis is reported in Baum (19901, in which the measured generalization error of multilayer perceptrons behaved roughly as W / m ,

Vapnik-ChervonenkisBounds

251

where W is the number of weights in the network. However, Baum’s target functions were randomly chosen multilayer perceptrons, and it may be the case that these target functions are much harder to learn than many classes of naturally occurring real-world problems. A recent, novel theoretical approach enables the calculation of expected performance of learning systems, rather than worst-case performance under certain conditions (Tishby et al. 1989; Schwartz et al. 1990). These theories, which we call “statistical learning theories,” calculate expected performance essentially by computing an ensemble average over all possible results of training on a set of size m. The formalism of Tishby et al. (1989) explicitly makes contact with the formalism of equilibrium statistical mechanics from physics in order to compute these averages. The result of both formalisms is a recursive formula for the expected generalization after m examples in terms of the generalization at m - 1 examples. Iteration of this formula back to m = 0 yields a formula for expected generalization in terms of po(g), the spectrum of possible generalizations before training on any examples. This formula gives the following robust prediction: if the spectrum of possible generalizations is continuous at g = 1 (i.e., perfect performance), then the expected generalization error will fall off as l/m, in agreement with the VC bound. On the other hand, if the spectrum of possible generalizations has a ”gap” near perfect performance, then the error will approach zero exponentially in the number of examples: E exp(-rn/mo), where mo is a constant related to the size of the gap. Numerical evidence for such an exponential approach to perfect performance in fact predates these theories, and was reported in Ahmad (1988) and Ahmad and Tesauro (1988), in which single-layered networks were trained on the majority function, a linearly separable Boolean predicate. The above indications of an exponential convergence to perfect performance provide encouragement that it might be possible to surpass the worst-case VC bound by a substantial margin in at least some cases. Unfortunately, since the statistical formalisms treat only the equilibrium average over the final results of learning, and do not consider possible dynamic effects of the learning process, it is not known whether they are directly applicable to a learning procedure such as backpropagation. We have attempted to address the issue of whether average performance can surpass worst-case performance by carrying out a series of detailed numerical experiments that measure the average generalization of simple neural networks trained on a variety of simple functions. Our experiments are a follow-up and extension of the work originally reported in Ahmad (1988) and Ahmad and Tesauro (1988). They test both the relevance of the worst-case VC bounds to average generalization performance, and the predictions of the statistical theories of exponential behavior due to a gap in the generalization spectrum. In the following section, we describe the input-output functions that were used in our experiments, and discuss our methodology for gath-

-

252

David Cohn and Gerald Tesauro

ering the data (measurements of average generalization as a function of number of examples), controlling for various sources of experimental error, and analyzing the data. In Section 3, we present results for singlelayer nets trained on linearly separable functions, while in Section 4, we present results for multilayer nets trained on higher order functions. Section 5 then compares these experimental results with the worst-case VC bounds for these learning systems. In Section 6, we describe a separate experiment that actually measures the generalization spectrum near perfect performance for one of the single-layer networks, in an attempt to find the theoretically predicted ”gap.” Finally, the concluding section discusses the implications of our results, and possible directions for future work. 2 Network Simulation Methodology

Two pairs of N-dimensional classification tasks were examined in our experiments: two linearly separable functions (“majority” and “realvalued threshold), and two higher order functions (”majority-XOR and “threshold-XOR). Majority is a Boolean predicate in which the output is 1 if and only if more than half of the inputs are 1. The real-valued threshold function is a natural extension of majority to the continuous space [0,1IN:the output is 1 if and only if the sum of the N real-valued inputs is greater than N / 2 . The majority-XOR function is a Boolean function where the output is 1 if and only if the Nth input disagrees with the majority computed by the first N - 1 inputs. This is a natural extension of majority that retains many of its symmetry properties, for example, the positive and negative examples are equally numerous and somewhat uniformly distributed. Similarly, threshold-XOR is a natural extension of the real-valued threshold function which maps [0, 1IN-’x ( 0 , l ) H ( 0 , l ) . Here, the output is 1 if and only if the Nth input, which is binary, disagrees with the threshold function computed by the first N-1 real-valued inputs. Networks trained on these tasks used sigmoidal units and had standard feedforward fully connected structures with at most a single hidden layer. The training algorithm was standard backpropagation with momentum (Rumelhart et al. 1986). A simulator run consisted of training a randomly initialized network on a training set of m examples of the target function, chosen uniformly from the input space. Networks were trained until all examples were classified within a specified training threshold of the correct classification. Once the classification error on all examples was less than the threshold, the network was judged to have converged, and training was stopped. Runs that failed to converge within a cutoff time of 50,000 epochs were discarded. The generalization error of the resulting network was then estimated by testing on a set of 8000 novel examples independently drawn from the same distribution. A test example was classified ”correctly” if

Vapnik-ChervonenkisBounds

253

the classification error was less than some testing threshold (in these cases the same as the training threshold). The fraction of the test examples not classified correctly was used as an estimate of the generalization error of the network. The average generalization error for a given value of m was typically computed by averaging the results of 40 simulator runs, each with a different set of training patterns, test patterns, and random initial weights. By plotting the average generalization error against m, one may derive an empirical ”generalization curve” for the function and network in question. We restricted our attention to values of m ranging from 40 to 2000 training examples. For training set sizes any smaller than this, the asymptotic behavior is likely to be irrelevant; for training set sizes much larger than 2000, there are two difficulties: first, the networks take an intolerably long time to converge (see Section 3), and second, as the generalization error approaches zero, the accuracy of error measurements degrades to a point where it is nearly useless. 2.1 Sources of Error. Our experiments were carefully controlled for a number of potential sources of error. Random errors due to the particular choice of random training patterns, test patterns, and initial weights were reduced to low levels by performing a large number of runs and varying each of these in each run. We have also looked for systematic errors due to the particular values of learning rate and momentum constants, initial random weight scale, training batch size, training threshold, and training cutoff time. Within wide ranges of parameter values, we find no significant dependence of the generalization performance on the particular choice of any of these parameters except k, the training batch size.’ (However, the parameter values do affect the rate of convergence and probability of convergence on the training set.) Variations in k appear to alter the numerical coefficients of the learning curve, but not the overall functional form. Another potential concern is the possibility of overtraining: even though the training set error should decrease monotonically with training time, the test set error might reach a minimum and then increase with further training. We have monitored hundreds of simulations of both the linearly separable and higher order tasks, and find no significant overtraining in either case. Other aspects of the experimental protocol that could affect measured results include order of pattern presentation, size of test set, testing threshold, and choice of input representation. We find that presenting the patterns in a random order as opposed to a fixed order improves the probability of convergence, but does not alter the average generalization of runs that do converge. Changing the criterion by which a test ‘Training batch size is the number of patterns during which the 6 values are accumulated before connection weights are changed during training. It should not be confused with training set size.

254

David Cohn and Gerald Tesauro

pattern is judged correct alters the numerical prefactor of the learning curve but not the functional form. For small-to-moderate values of M, we confirmed that our test set sizes wefe sufficiently large by doubling the number of test patterns and finding no significant change in the measured generalization values. Finally, convergence is faster with a [-l,11 coding scheme than with a [0,I] scheme, and generalization is improved, but only by numerical constants. 2.2 Analysis of Data. To determine the functional dependence of measured generalization error t on the training set size m, we apply the standard curve-fitting technique of performing linear regression on the appropriately transformed data. Thus we can look for an exponential law t = Aecm/moby plotting log(€)vs. m and observing whether the transformed data lie on a straight line. We also look for a polynomial law of the form E = B / ( m + a ) by plotting l / t vs. m. We have not attempted to fit to a more general polynomial law because this is less reliable, and because theory predicts a I / m law. By plotting each experimental curve in both forms, log(€)vs. m and 1 1 6 vs. m, we can determine which model provides a better fit to the data. This can be done both visually and more quantitatively by comparing the values of r2, the linear correlation coefficient, in the two plots.

3 Experiments on Linearly Separable Functions

Networks with 50 inputs and no hidden units were trained on majority and real-valued threshold functions, with training set sizes ranging from m = 40 to m = 2000 in increments of 40 patterns (below rn = 540, we used an increment of 20 patterns). Forty networks were trained for each value of m, for a total of 2440 runs on each function. A total of 1.2% of the binary majority and 10.5%of the real-valued threshold simulation runs failed to converge and were discarded. The data obtained from the binary majority and real-valued threshold problems were tested for fit to the exponential and polynomial functional models, as shown in Figure 1. In the exponential plot, a change in the shape of the curve is visible as the size of the training set approaches 220 examples. Since we are interested in asymptotic behavior, we restricted our analysis to training set sizes 240 and above. The binary majority data fit the exponential model with a correlation coefficient of r2 = 0.9927, compared with a correlation coefficient of 1.2 = 0.8266 when fit to the polynomial model. Based on this evidence and the graphs in Figure 1, we conclude that the binary majority data are consistent with an exponential law and not with a l/m law. The real-valued threshold data, however, behaved in the opposite manner. The exponential fit gave a value of r;! = 0.9366, while the

0

10

20

40

500

I

Training set size

I

'. 2000

1

1500

1000

L -

I

I

0 '

20

40

60

80

100

120

Ma jority-XOR

'

I

500

,,,'o

/

'

/

r o _'

I

I

I

,6

I

0 , '

I

/ ,

I

1500

,,'

\I

Training set size

1000

I

0

/

/

/ /

I

I

I

/

'\

Std error of mean - .- - - - Best fit

0

_ _ _

/

?'

/

'

/

0

/

T

,l'

0

2000

I

/

0

Figure 1: Observed generalization curve for binary majority and its fit to the exponential (left) and polynomial (right)models. The dotted line indicates the best fit to the model and dashed lines indicate standard error of the mean for each training set size.

2

E

U

x

, 0 i

: 100

II

200

400

140

David Cohn and Gerald Tesauro

256

Table 1: The Transition between Polynomial and Exponential-Shaped Learning Curves for the Multilayer, Binary Majority Problem Appears to Occur Near a Fixed Error Rate. Number of

Transition point

Input size hidden units Error (4 Training set size (m) VCD/m 25 25 25 25 25

1 2 3 4 10

50 50 50 50

1 2 3 4

-

-

-

0.12 0.11 0.12 0.11

210 320 280 510

0.24 0.24 0.36 0.49

0.41 0.14 0.13 0.14

55 300 380 420

0.93 0.34 0.50 0.48

polynomial fit gave a value of r;! = 0.997, indicating a very good fit to the l / m polynomial model and only a fair fit to the exponential. This may be confirmed visually in Figure 2 by noting the distinct curvature in the fit to the exponential model and the lack of any curvature in the fit to the polynomial model. Of interest too, is the total number of pattern presentations needed to train the networks to convergence (training set size x number of epochs). While the number of presentations needed for the binary majority problem peaked briefly at around 200 examples and. then leveled out, the number of presentations needed for the real-valued threshold function increased monotonically with training set size (Fig. 3). 3.1 Training Multilayer Networks on Linearly Separable Functions. To extend our results on the linearly separable problems, we ran a number of experiments training networks with a single layer of hidden units on the 25 and 50 input binary majority and real-valued threshold functions. The generalization curves of multilayer networks trained on the realvalued threshold functions appeared to fit the polynomial model, with the generalization error increasing as more hidden units were added. The generalization curves for networks trained on the binary majority function were surprising: they exhibited a polynomial shape up to a point, after which the curve shifted to a distinctly exponential shape (Fig. 4). This point appeared at an approximately constant error rate, where c = 0.12 for the 25-input case and where t = 0.14 for the 50-input case (see Table 1).

0

2

4

10

20

500

1000

Training set size

1500

2000

0

O

10

20

30

40

....

Training set size

1000

Real-valued threshold Std error of mean Best fit

500

--.

_ _0 -

1500

I

p

I

2000

-\ I . /

I

Figure 2 Observed generalization curve for real-valued threshold function and its fit to the exponential and polynomial models. The dotted line indicates the best fit to the model and dashed lines indicate standard error of the mean for each training set size.

I u

E

X

rl 0

I

'

*)

40

50

David Cohn and Gerald Tesauro

258

10000

--

I!

Pattern presentations x 10A3

4 i

1000

Threshold Majority

3

100

0

500

1000

1500

2000

Training set size

Figure 3: The number of pattern presentations needed to train the networks on the majority function was roughly independent of training set size, but for the threshold function it increased monotonically with training set size.

With this shift in mind, one may look back to the experiments on the binary majority function without hidden units and observe that this is approximately where the shift in form appears in its generalization curve. Examining reasons for this shift and its location are beyond the intent of this paper, but will need to be addressed and explained in the future if the phenomenon of generalization is to be fully understood. 4 Experiments on Higher Order Functions __

For the majority-XOR and threshold-XOR problems, we used N = 26 input units: 25 for the "majority" (or threshold) and a single "XOR unit. In theory, these problems can be solved with only two hidden units, but in practice, at least three hidden units were needed for reliable convergence. We found that adding more than the minimal number of hidden units decreased generalization performance but left the functional form of the generalization curves unchanged. Training set sizes ranged from m = 40 to m = 2000, in increments of 20 for m < 1000 and in increments of 100 for m > 1000. For each training set size where m < 1000, 40 simulations were run; for each m > 1000 we ran 80 simulations. A total of 1.6% of the majority-XOR runs and 7.8% of the threshold-XOR runs failed to converge on their training sets and were discarded.

Vapnik-ChervonenkisBounds

m

2oo

259

i

\ \

0

40

*O

1

L

0

_--

0

100

Threshold Std error of mean

200

I 300

400

0

500

Training set size

Figure 4: The generalization curve of a network with two hidden units trained on the binary majority function. A transition from a polynomial learning curve to an exponential one is visible near the point where training set size m = 300 and error E = 0.14. With both sets of runs, there was a visible change in the shape of the generalization curve when the training set size reached 200 samples. We are interested primarily in the asymptotic behavior of these curves, so we restricted our analysis to training set sizes 200 and above. As with the single-layer problems, we measured goodness of fit to appropriately linearized forms of the exponential and polynomial curves in question. Results are plotted in Figures 5 and 6. It appears that the generalization curve of the threshold-XOR problem is not likely to have been generated by an exponential, but is a plausible 1/ m polynomial. The correlation coefficient of the fit to the exponential model is only 1.2 = 0.8991, but in the fit to the polynomial model, 1.2 = 0.9986. The binary majority-XOR data are not as straightforward. Although it is significantly faster than a l / m polynomial, the tail of the curve (above

I

500

I

Training set size

1000

1500

\ I

2000

o.

. 90 0

0

40

eo

0

500

Training set size

1000

Binary majority Std error of mean fit

- - . Best

..- .

_ _ _

/

I

1

I

,

1500

I

1

0

0

O

,'

2000

pa

0

0

00

Figurr 5: Generalization curve for 26-3-1 nets trained on majority-XOR,and its fit to the exponential and polynomial models. The dotted line indicates the best fit to the model and dashed lines indicate standard error of the mean for each training set size.

0

0 majority -- ainary Std error of mean

_ . _Best __ fit-

\

120

160

200

240

280

320

\

\

0 0 0

\

O\

\ ‘\

\

N

0 0 Lo rl

aJ

-4 N

m

o o o

u w

m

r(

b, 4 rl

U

r 0 0 v)

0

0 0 (u

0 0 Y) rl

Q -4

m

o

u

z o P -4

E

-44(

h 0 0 v)

0

David Cohn and Gerald Tesauro

262

m = 1400) is slower than a pure exponential. Despite this, the match to the exponential model is fair, with r;! = 0.9755, compared to the match with the polynomial model, where 1.2 = 0.9345. It is interesting to note that, as with networks trained on the singlelayer problems, the number of epochs required to converge was relatively independent of training set size for the binary problem, but increased regularly with training set size for the real-valued problem. However, this dependence does not appear until the training set size exceeds roughly 200, the same place where the shape of the generalization curve changes. 5 Comparison to Theory

We examined how the observed generalization curves compared to the theoretical upper bounds described in Blumer et al. (1989) and Haussler et al. (1990). This comparison is illustrated in Figure 7. For the N-input, single-layer networks, the VC-dimension of the network is N + 1. In the higher order case we have used the total number of weights as an estimate of the VC-dimension; this is a lower bound on the dimension, following Baum and Haussler (1989). The general worst-case upper bound described in Blumer et al. (1989) is derived in terms of training set size m: an arbitrary learner that classifies all m of its randomly drawn training examples correctly will have generalization error of less than with confidence 1 - 6 i f

[a ::

m 2 max -log -,

- log -

131 E

Since this bound is a distribution-free worst-case upper bound for any algorithm that classifies the training data correctly, it is not surprising that it lies considerably above any of the observed curves, and, in the case of the higher order functions, is completely off the scale. The bound in Haussler et al. (1990, 1991) is a distribution-free worstcase upper bound for a Bayes-optimal learning algorithm. If such an algorithm classifies its rn randomly drawn training examples correctly, then it will, with high confidence, have a generalization error of at most c 5 d / m . It has been shown that the backpropagation algorithm is Bayesoptimal to the extent that the algorithm is able to find a global error minimum and to the extent that the network is able to represent the optimal Bayes function (Ruck et al. 1990; Wan 1990). In our experiments, networks that converged appeared to find globally near-minimal errors, but it is unknown to what extent the networks were able to represent the Bayes-optimal solutions. In the limit of large training set sizes, the Bayes-optimal solution will approach the true classification boundary 21n Figure 7, this bound is plotted for t as a function of the minimum m for which the bound holds.

30

0

iI

0

X

1000

I

Training set size

500

I

O00

oo

I

0

0

1500

Bayes-optimal bound' Threshold Majority

- .. BEHW bound

_--

----

00

0

2000

I

00

0

oo~oo

1000

0

- -

4 0

X

..

0 0

. 0

Training set size

1000.

I

I

0

0 0

0

2000

---

0 0

1500

Bayes-optimal bound Threshold-XOR Majority-XOR

500

__-

i%,

\

perfonns roughly within a constant factor of the predicted bound, while majority-XOR performs asymptotically better. Note that for the XOR functions,the Blumer et al. bound is so high as to be off the graph.

Figure 7 (left) The real-valued threshold problem performs roughly within a constant factor of the upper bounds predicted in Blumer et al. (1989) and Haussler et al. (1990), while the binary majority problem performs asymptotically better. (right) The threshold-XOR

; 10

8

M

X

I:

i

m

100

300

1000

264

David Cohn and Gerald Tesauro

(perfect generalization). Since the networks in our experiments are able to represent the true boundary, this bound should apply in the limit. At smaller training set sizes, however, the bound should give us guidance, but does not strictly apply. One may note in Figure 7 that the observed curves lie below, but within a small numerical constant of this bound. The worst-case lower bound in Ehrenfeucht et al. (1988) argues that for any learning algorithm, there exists a learning problem such that, even if the learner classifies all m training examples correctly, it cannot guarantee an error of less than 6 with high confidence unless E 2 (d - 1)/(32m). Its significance is that it tells that a function and sample distribution exist that will give us an error of at least t for arbitrary m. It is also of interest to compare our results with recent theoretical work of Sompolinsky et al. (19901, which uses statistical mechanics formalism to calculate the expected generalization error of a single-layer perceptron. Sompolinsky et al. find that, when the weights of the perceptron are binary, or at least when the prior distribution of weights is sharply peaked around +1 and -1, the generalization error has a discontinuous transition to perfect generalization at a finite number of training examples. However, when the weights are continuous and the priors are not strongly peaked, a l / m curve is recovered, at least in the high-temperature limit. This theory assumes that the input space is continuous, so it may apply to our results for the real-valued threshold function. It would be of interest to extend the theory to discrete input spaces, to see if it could account for our exponential learning curves for the binary majority function. 6 Examination of "The Gap"

The theories described in Tishby et al. (1989) and Schwartz et al. (1990) predict that if there is a gap in the spectrum of possible generalizations near the level of perfect generalization, then the shape of the learning curve will be exponentially decreasing. In this section we consider the nature of this gap in more detail, since it could provide an explanation for the exponential learning curves seen in our binary experiments. At first glance one might think that, since the input space is discrete in our binary experiments, the generalization is necessarily quantized in levels of l/ZN,and that this discretization provides the gap that explains our results. However, this cannot be the case: to explain our experiments, the gap must be much larger than l / Z N . An exponential curve due to a gap of this size would have a characteristic scale of size rno ZN, and the scales seen in our experiments are much smaller than this. Gaps that are much larger than 1/2N might come about in a number of ways. One possibility is that the network simply cannot represent nearperfect solutions. We call such a gap a "representation" gap. Another possibility is that, even though the network can represent near-perfect solutions, the total volume in weight space of such solutions is negligi-

-

Vapnik-chervonenkis Bounds

265

ble. We call this a ”weight space” gap. Finally, it may be the case that, although there are many near-perfect solutions, the learning algorithm for some reason tends to avoid them. We call this an “inductive” gap, as it indicates an inductive bias in the learning rule. 6.1 There Is No Representation Gap. By analyzing the structure of the single layer network, it can be shown that even in the binary case, no representation gap exists. For the majority function with N = 2k + 1 inputs we can build a configuration that classifies all but one of the 22+’ possible patterns correctly as follows: set the bias of the output unit at -(k+ 1)’ so that it will output a ”1”only if the sum of its inputs is greater than or equal to this value. For the first k + 1 input units, set the weights connecting them to the output equal to “1.” For the last k units, set the weights to k + 1/k. Examination will reveal that this network will classify all inputs of the majority problem correctly except for the input

-k

k+l

.

Xerror = 000. ’ 0 111.

. .1

Similar networks can be constructed that will correctly classify all but two, three, or more of the possible inputs. This analysis extends to the majority-XOR function. If one of the hidden units computes the above off-by-one majority function of the first N - 1 inputs, the rest of the network, computing an XOR, will misclassify exactly the two input patterns that begin with x,,,. 6.2 There Is No Weight Space Gap. Since the gap is not inherent in the representation, we next examined the volume of near-perfect solutions in weight space. This was done by making small random perturbations about a known perfect solution to the 11-input majority p r ~ b l e m . ~ The initial solution of the network consisted of a weight of 1 connecting each of the inputs to the output, and a bias of -5.5 at the output. We added uniformly distributed random noise to each of the weights in the network, and exhaustively tested the generalization of the resulting network. This was repeated over a range of eight noise levels from 0.1 to 1.25 with 150,000 networks for each noise level. For all the perturbation levels, the number of ”near perfect” solutions [t = 0(1/2N)]was nonnegligible, approaching the number of perfect solutions as the noise increased. We then compared the generalization spectrum of 150,000 perturbations using 0.65 noise with that of 750,000 perturbations using 0.75 noise. The 11-dimensionalvolume accessible with a perturbation of 0.75 is approximately 5 times the volume accessible with a perturbation of 0.65, so 3An 11-inputnetwork was the largest network that computing time permitted testing exhaustively. The 11-input problem has 2048 possible inputs, while the 25-bit problem discussed earlier has over 33 million.

David Cohn and Gerald Tesauro

266

-

1600

--+--

1400

0.65 perturbation 0 . 1 5 perturbation

1200

2

1000

P $

800

400

200

I

0.975

0.980

I

0.985

I

0.990

1

0.995

1

1.000

Generalization

Figure 8: Empirically derived generalizationspectrum indicates that statistically, the number of "near-perfect" solutions is comparable to the number of perfect solutions for the binary majority function.

the 5 times as many runs should give us a density near the original solution that is comparable with the 150,000 runs at 0.65. The fact that the number of near-perfect solutions remains quite close while the number of less correct solutions increases sharply (see Fig. 8) indicates that most, if not all, of the near-correct solutions lie close in weight space to correct solutions. This experiment gives us confidence that the original set of perturbation experiments provides an accurate picture near perfect generalization performance: that near solutions are at least as common in configuration space as completely correct solutions, and that there is no quantitative gap in solution space. 6.3 What Is Responsible for ExponentialGeneralization? Although we have observed exponential generalization on binary problems, we have been unable to identify the causes of this behavior in terms of the statistical theories that predict it.

Vapnikxhervonenkis Bounds

267

These theories do not take into account any dynamic effects of the learning process, so it is possible that there is an "inductive gap": the backpropagation algorithm may, for some reason, avoid the observed near-perfect solutions. The parameters of the backpropagation algorithm affect the numerical coefficients of the generalization curve to a small extent, so it is conceivable that the choice of the algorithm itself has an unpredicted effect on generalization. Another possibility is that the exponential behavior is inherent in the nature of the binary functions themselves, and is independent of the network architecture and learning algorithm. If this is the case, then we can expect that with advances in statistical learning theory we will see theories that account for this behavior without relying on a gap in representation or solution space. 7 Conclusions

We have seen that two problems using strict binary inputs (majority and majority-XOR) exhibited distinctly exponential generalization with increasing training set size. This indicates that there exists a class of problems that is asymptotically much easier to learn than others of the same VC-dimension. This is not only of theoretical interest, but it also has potential bearing on what kinds of large-scale applications might be tractable with network learning methods. On the other hand, merely by making the inputs real instead of binary, we found average error curves lying close to the theoretical bounds. This indicates that the worst-case bounds may be more relevant to expected performance than has been previously realized. The statistical theories of Tishby et al. (1989) and Schwartz et al. (1990) predict the two classes of behavior seen in our experiments, but there appears to be a discrepancy with theory in that we have been unable to find evidence of a "gap" in the generalization spectrum that would explain the observed exponential behavior. We have not yet ruled out the possibility of an inductive gap, and will attempt to look for such a gap in future experiments. The apparent discrepancy might also be resolved through further investigation of the consequences of the statistical learning theories. There could be other mechanisms by which these theories could explain the exponential behavior. (For example, a "dip" in the generalization spectrum, as opposed to a gap, might also result in exponential behavior.) But it might also be the case that the statistical theories in their current form simply do not apply to learning rules such as backpropagation, and that alternative formalisms need to be developed. Whatever mechanism accounts for the exponential generalization could also have important practical implications for those seeking to improve neural network generalization. If the difference in performance is due to an inductive gap (or any other constraint) imposed by the learn-

268

David Cohn and Gerald Tesauro

ing algorithm, then it would be fruitful for future research t o focus on improved learning algorithms that exploit this behavior, and on learning theories that take the learning process into account. If instead the difference in generalization performance is inherent in the nature of the problems studied and can be explained by statistical methods, then it may be better to focus attention on problem specification and representation.

Acknowledgments We would like to thank a n anonymous referee for helpful comments on earlier versions of this manuscript. A portion of the work described here was done while D. Cohn was at IBM Research Center.

References Ahmad, S. 1988. A study of scaling and generalization in neural networks. Tech. Rep. UILU-ENG-88-1759, University of Illinois at Urbana-Champaign. Ahmad, S., and Tesauro, G. 1988. Scaling and generalization in neural networks: A case study. In Proceedings of the 1988 Connectionist Models Summer School, D. S. Touretzky et al., eds., pp. 3-10. Morgan Kaufmann, San Mateo, CA. Baurn, E. 8. 1990. When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples? In Neural Networks EURASIP Workshop, L. B. Almeida and C. J. Wellekens, eds., pp. 2-25. Springer-Verlag, Berlin. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1(1),151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1989. Learnability and the Vapnik-Chervonenkis dimension. 1.Assoc. Comp. Mach. 36(4), 929965. Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. 1988. A general lower bound on the number of examples needed for learning. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Haussler, D., Littlestone, N., and Warmuth, M. 1990. Predicting (0, 1)-functions on randomly drawn points. Tech. Rep. UCSC-CRL-90-54, University of California at Santa Cruz. Haussler, D., Kearns, M., and Schapire, R. 1991. Unifying bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In Proceedings of the 4th Annual Workshop on Computational Learning Theory, pp. 61-74. Morgan Kaufmann, San Mateo, CA. Ruck, D. W., Rogers, S. K., Kabrisky, M., Oxley, M. E., and Suter, B. W. 1990. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transact. Neural Networks 1(4), 296-298. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, pp. 381-362. MIT Press, Cambridge, MA.

Vapnik-Chervonenkis Bounds

269

Schwartz, D. B., Samalam, V. K., Solla, S. A., and Denker, J. S. 1990. Exhaustive learning. Neural Comp. 2,374-385. Sompolinsky, H.,Tishby, N., and Seung, H. S. 1990. Learning from examples in large neural networks. Phys. Rev.Lett. 65, 1683-1686. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalizations. IJCNN Proc. 2, 403409. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, New York. Wan, E. 1990. Neural network classification: A Bayesian interpretation. IEEE Transact.Neural Networks 1(4), 303-305.

Received 18 March 1991; accepted 1 July 1991.

This article has been cited by: 2. Konstantin Kogan, Charles Tapiero. 2008. Vertical pricing competition in supply chains: the effect of production experience and coordination. International Transactions in Operational Research 15:4, 461-479. [CrossRef] 3. Rosa A. Schiavo, David J. Hand. 2000. Ten More Years of Error Rate Research. International Statistical Review 68:3, 295-310. [CrossRef] 4. Hanzhong Gu , Haruhisa Takahashi . 2000. Exponential or Polynomial Learning Curves? Case-Based StudiesExponential or Polynomial Learning Curves? Case-Based Studies. Neural Computation 12:4, 795-809. [Abstract] [PDF] [PDF Plus] 5. Hanzhong Gu, H. Takahashi. 2000. How bad may learning curves be?. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:10, 1155-1167. [CrossRef] 6. Hiroki Suyari, Ikuo Matsuba. 1999. Information theoretical approach to the storage capacity of neural networks with binary weights. Physical Review E 60:4, 4576-4579. [CrossRef] 7. Jianfeng Feng. 1998. Journal of Physics A: Mathematical and General 31:17, 4037-4048. [CrossRef] 8. David Haussler, Michael Kearns, H. Sebastian Seung, Naftali Tishby. 1997. Rigorous learning curve bounds from statistical mechanics. Machine Learning 25:2-3, 195-236. [CrossRef] 9. H. Gu, H. Takahashi. 1996. Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks 7:4, 953-968. [CrossRef] 10. Y. Hamamoto, S. Uchimura, S. Tomita. 1996. On the behavior of artificial neural network classifiers in high-dimensional spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:5, 571-574. [CrossRef] 11. Thorsteinn Rögnvaldsson . 1993. Pattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling BehaviorPattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling Behavior. Neural Computation 5:3, 483-491. [Abstract] [PDF] [PDF Plus]

Communicated by Fernando Pineda

Working Memory Networks for Learning Temporal Order with Application to Three-Dimensional Visual Object Recognition Gary Bradski Gail A. Carpenter Stephen Grossberg Center for Adaptive Systems and Department of Cognitive and Neural Systems, Boston University, Boston, M A 02215 U S A

Working memory neural networks, called Sustained Temporal Order REcurrent (STORE) models, encode the invariant temporal order of sequential events in short-term memory (STM). Inputs to the networks may be presented with widely differing growth rates, amplitudes, durations, and interstimulus intervals without altering the stored STM representation. The STORE temporal order code is designed to enable groupings of the stored events to be stably learned and remembered in real time, even as new events perturb the system. Such invariance and stability properties are needed in neural architectures which self-organize learned codes for variable-rate speech perception, sensorimotor planning, or three-dimensional (3-D) visual object recognition. Using such a working memory, a self-organizing architecture for invariant 3-D visual object recognition is described. The new model is based on the model of Seibert and Waxman (1990a1, which builds a 3-D representation of an object from a temporally ordered sequence of its two-dimensional (2-D) aspect graphs. The new model, called an ARTSTORE model, consists of the following cascade of processing modules: Invariant Preprocessor ART 2 + STORE Model -.+ ART 2 + Outstar Network. .--)

1 Introduction Working memory is the type of memory whereby a telephone number, or other novel temporally ordered sequence of events, can be temporarily stored and then performed (Baddeley 1986). Working memory, a kind of short-term memory (STM), can be quickly erased by a distracting event, unlike long-term memory (LTM). There is a large experimental literature about working memory, as well as a variety of models (Atkinson and Shiffrin 1971; Cohen and Grossberg 1987; Cohen et al. 1987; Elman 1990; Grossberg l970,1978a,b; Grossberg and Pepe 1971; Grossberg and Stone Neural Computation

4, 270-286 (1992)

@ 1992 Massachusetts Institute of Technology

Working Memory Networks

271

1986; Gutfreund and Mezard 1988; Guyon et al. 1988; Jordan 1986; Reeves and Sperling 1986; Schreter and Pfeifer 1989; Seibert 1991; Seibert and Waxman 1990a,b; Wang and Arbib 1990). The present class of models, called STORE (Sustained Temporal Order REcurrent) models, exhibit properties that have heretofore not been available in a dynamically defined working memory. In particular, STORE working memories are designed to encode the invariant temporal order of sequential events, or items, that may be presented with widely differing growth rates, amplitudes, durations, and interstimulus intervals. The STORE model is also designed to enable all possibIe groupings of the events stored in STM to be stably learned and remembered in LTM, even as new events perturb the system. In other words, these working memories enable chunks (also called compressed, categorical, or unitized representations) of a stored list to be encoded in LTM in a manner that is not erased by the continuous barrage of new inputs to the working memory. Working memories with these properties are important in many applications wherein properties of behavioral self-organization are needed. Three important applications are real-time self-organization of codes for variable-rate speech perception, sensorimotor planning, and 3-D visual object recognition. Architectures for the first two types of application are described in Cohen et al. (1987) and Grossberg and Kuperstein (1989). Herein we outline how such a working memory can both simplify and extend the capabilities of the Seibert and Waxman model for 3-D visual object recognition (Seibert and Waxman 1990a,b; Seibert 1991). 2 Invariance Principle and Partial Normalization

The STORE neural network working memories are based on algebraically characterized working memories that were introduced by Grossberg (1978a,b). These algebraic working memories were designed to explain a variety of challenging psychological data concerning working memory storage and recall. In these models, individual events are stored in working memory in such a way that the pattern of STM activity across event representations encodes both the events that have occurred and the temporal order in which they have occurred. In the cognitive literature, such a working memory is often said to store both item information and order information (Healy 1975; Lee and Estes 1981; Ratcliff 1978). The models also include a mechanism for reading out events in the stored temporal order. An event sequence can hereby be performed from STM even if it is not yet incorporated through learning into LTM, much as a new telephone number can be repeated the first time that it is heard. The large data base on working memory shows that storage and performance of temporal order information from working memory are not always veridical (Atkinson and Shiffrin 1971; Baddeley 1986; Reeves and

272

G. Bradski, G. A. Carpenter, and S. Grossberg

Sperling 1986) These deviations from veridical temporal order in STM could be explained by the algebraic working memory model as consequences of two design principles that have clear adaptive value. These principles are called the Invariance Principle and Partial Normalization (Grossberg 1978b). 2.1 Invariance Principle. The spatial patterns of STM activation across the event representations of a working memory are stored and reset in response to sequentially presented events in such a way as to leave the temporal order codes of all past event groupings invariant. In particular, a temporal list of events is encoded in STM in a way that preserves the stability of previously learned LTM codes for familiar sublists of the list. For example, suppose that the word MY has previously been stored in a working memory's STM and has established a learned chunk in LTM. Suppose that the word MYSELF is then stored for the first time in STM. The word MY is a syllable of MYSELF. The STM encoding of MY as a syllable of MYSELF may not be the same as its STM encoding as a word in its own right. On the other hand, MY'S STM encoding as part of MYSELF should not be allowed to force forgetting of the LTM code for MY as a word in its own right. If it did, familiar words, such as MY, could not be learned as parts of larger words, such as MYSELF, without eliminating the smaller words from the lexicon. More generally, new wholes could not be built from familiar parts without erasing LTM of the parts. The Invariance Principle can be algebraically realized as follows, provided that no list items are repeated. Assume for simplicity that the ith list item is preprocessed by a winner-take-all network. Each list item then activates a single output node of the preprocessor network. Properties of the working memory also hold if a finite set of output nodes is activated for each item. The winner-take-all case is described herein for notational simplicity. Let the winner-take-all node that is activated by the ith item send a binary input I, to the first working memory level F1 (Fig. 1). Let x, denote the activity of the ith item representation of F1. Suppose that I, is registered in working memory at time t,. At time t,, the activity pattern [XI (t,),x 2 ( t r ) ,. . . , x,(tI)] across F1 stores the effects of the list II,I2,.. . ,I, of previous inputs. The input I, updates the activity values x k ( L l ) to new values xk(t,) for all nodes k = 1,2,. . . , i according to the following rule:

(2.1) At time f,, the pattern [~1(fi-~),x2(fi-~), . . . , ~ , - ~ ( t ~of- previously ~)] stored STM activities is multiplied by a common factor wi as the ith item is instated with some activity pi. The storage rule (2.1) satisfies the Invariance Principle for the following reason. Suppose that F1 is the first level of a two-level competitive

Working Memory Networks

273

Ii-1 ............................... 1

I1

Ii+l

Ii

t

ti

ti-1

c;

9

j

ai

i

pi

!Yk'X k( ti -) i 1 y ;+ xk (k t )i

j

1j

Figure 1: (a) Elementary STORE model: STM activity xi at level 1 registers the item input 1i, nonspecific shunting inhibition x, and level 2 STM yi. STM activity yi at level 2 registers x i . Complementary input-driven gain signals 1 and 1' control STM processing at levels 1 and 2. (b) Input 1i(t) equals 1 for ti - ai < t 5 ti. When all inputs are off (ti < t 5 ti +pi) level 2 variables yk relax to level 1 values x k ( t i ) . learning network (Grossberg 1976). Then F1 sends signals to the second level F2 via an adaptive filter. The total input to the jth F2 node is Ckxkzkj, where zkj denotes the LTM trace, or adaptive weight, in the path from the kth F1 node to the jth F2 node. In psychological terms, each active F2 node represents a chunk of the F1 activity pattern. When the jth F2 node is active, the LTM weights zq converge toward x k ; in other words,

274

G. Bradski, G. A. Carpenter, and S. Grossberg

the weight vector becomes parallel to the F1 activity vector. When a new item is added to the list, the Invariance Principle implies that the previously active items in the list will simply be multiplied by a common factor, thereby maintaining a constant ratio between the previously active items. Constant activity ratios imply that the former F1 activity vector remains parallel to its weight vector as its magnitude changes under new inputs. Hence, adding new list items does not invalidate the STM and LTM codes for sublists. In particular, the temporal order of items in each sublist, encoded as relative sizes of both the STM and the LTM variables, remains invariant. 2.2 Partial Normalization. The Partial Normalization rule algebraically instates the classical property of the limited capacity of STM (Atkinson and Shiffrin 1971). A convenient statement of this property is given by the equation

(2.2) where O1 = 1 and 0, decreases toward 0 as i increases. For example, let 0I - 01-1 ,with 0 < 0 < 1. Total activity S,increases toward an asymptote, S, as new items are presented. Parameter S characterizes the ”limited capacity” of STM. In human subjects, this parameter is determined by biological constraints. In an artificial neural network, parameter S can be set at any finite value. Using equations 2.1 and 2.2, it was proved in Grossberg (1978a) that the rate at which S, approaches its asymptote S helps determine the form of the STM activity pattern. The pattern (xl,. . . , x , ) can exhibit primacy (all xk-] > xk), recency (all xk-l < xk), or bowing, which combines primacy for early items with recency for later items (Grossberg 1978a). These various patterns correspond to properties of STM storage by human subjects. In particular, model parameters are typically set so that the STM activity pattern exhibits a primacy gradient in response to a short list. Since more active nodes are read-out of STM before less active nodes during performance trials, primacy storage leads to the correct order of recall in response to a short list. Using the same parameters, the STM activity pattern exhibits a bow in response to longer lists, and approaches a recency gradient in response to still longer lists. An STM bow leads to performance of items near the list beginning and end before items near the list middle. A larger STM activity at a node also leads to a higher probability of recall from that node under circumstances when the network is perturbed by noise. An STM bow thus leads to earlier recall and to a higher probability of recall from items at the beginning and the end of a list. These formal network properties are also properties of data from a variety of experiments about working memory, such as free recall experiments during which human subjects are asked to recall list items after being exposed to them once in a prescribed order (Atkinson and

Working Memory Networks

2 75

Shiffrin 1971; Healy 1975; Lee and Estes 1981). Effects of LTM on free recall data have also been analyzed by the theory (Grossberg 1978a,b). The multiplicative gating in equation 2.1 and the partial normalization in equation 2.2 are algebraic versions of the types of properties that are found in a general form in shunting competitive feedback networks (Grossberg 1973). A task of the present research was to discover specialized shunting networks that realize equations 2.1 and 2.2 as emergent properties of their real-time dynamics. The STORE model is a real-time shunting network, defined below, which exhibits the desired emergent properties. In particular, the STORE system moves from primacy to bowing to recency as a single model parameter is increased.

3 Working Memories Invariant Under Variable Input Speed, Duration,

and Interstimulus Interval Two types of real-time working memories, transient models and sustained models, can realize the invariance and partial normalization properties. In a transient model, presentation of items of different durations can alter the previously stored pattern of temporal order information. Transient memory models can still accurately represent temporal order if input durations are controlled by a preprocessing stage. Sustained models allow input durations and interim intervals to be essentially arbitrary: so long as these intervals are not too short, temporal input fluctuations have no effect on patterns stored in memory. A sustained neural network model is defined below. This two-level STORE model codes lists of distinct items. A variant of the STORE model design, to be discussed in a subsequent article, can encode the temporal order of lists in which each item may occur multiple times. Each item may also be represented by multiple nodes. The first level of the STORE model (Fig. la) consists of nodes with STM activity x,. The ith item is assumed to send a unit input I, to the ith node for a time interval of iength a,. After an interstimulus interval of length /3, the next item sends an input to the (i 1)st node, and so on. Each STM node also receives shunting inhibition via a nonspecific feedback signal that is proportional to the total STM activity x. The second STORE level consists of excitatory interneurons whose activity yl tracks x,. A critical additional factor in the model is gain control that enables changes in xl to occur only when an input is present and enables changes in yt to occur only when no input is present. This alternating gain control allows feedback from yk to xk (k < i ) to preserve previously stored patterns even when a new input I, is on for a long time interval. These processes are defined below in the simplest way possible to permit complete analysis and understanding of the model's emergent properties.

+

G. Bradski, G. A. Carpenter, and S. Grossberg

276

3.1 STORE model equations. The STORE model is defined by the dimensionless equations dxj

+

[Mi yi - X i X ] l

-=

at

(3.1)

and

where (3.3) 1 1'

(3.4) k

=

1-1

(3.5)

and (3.6)

X i ( 0 ) = Yi(0) = 0

The input sequence 1, is given by if ti - a; < t < ti otherwise

lj(t) =

(3.7)

(Fig. Ib). The input durations (a,)and the interstimulus intervals (pi = ti - a; - ti-i) are assumed to be large relative to the dimensionless relaxation times of xi and yi, set equal to 1 in equations 3.1 and 3.2. Thus each xi reaches steady state when inputs are on and each yi reaches xi when inputs are off. Otherwise, ti and ai can be arbitrary, and their values have no effect on patterns of memory storage. 3.2 Temporal order patterns. We will now examine how system properties vary as a function of the single free parameter, A, in equation 3.1. We will see that, in all cases, patterns of past activities remain invariant as new inputs perturb the system, and partial normalization obtains. In addition, the STM pattern ( X I , . . . , x i ) exhibits primacy for small A, recency for A > 1, and bowing for intermediate values of A, as follows. By equations 3.1, 3.6, and 3.7, when the ith input is presented, l i = 1, yi = 0, and xi

A +

-

(3.8)

X

Fork < i , l k

=0

and (3.9)

Working Memory Networks

277

Thus the relative sizes of the activities in pattern ( X I , . . . xj-1) are preserved when xi becomes active. Amplitudes increase uniformly if total activity x < 1, and decrease uniformly if x > 1. Equations equations 3.1 and 3.2 imply that the variable x obeys the equations ax

- = [A+y-X2]1 at

(3.10)

and (3.11)

where (3.12)

Since y(0) = 0, equation 3.10 implies that x ( t 1 ) i > 1, equation 3.10 implies that

=

a.At time t

=

t;,

and equation 3.11 implies that (t;) = x ( f ; - l ) . Thus the total activity S; at time ti satisfies S1 = x ( t l ) = A and

/

S, z x ( t i ) = I/A

+ Si-1,

i >1

As the number of items increases, both Si and x ( t ) approach

+d

S = .5[1

m ]

which is the positive solution of S=V%XS

For large A, therefore, s1s-'

S1S-l 2

1, whereas for small A,

= a<< 1

Thus, for large A, the total STM activity is approximately normalized at all times, whereas for small A, it grows rapidly as more inputs perturb the network. Since the size of parameter A in equation 3.1 reflects the degree to which the input Ii influences the STM pattern, recency for large A (present input dominates) and primacy for small A (past activities dominate) would be intuitively predicted. In fact, for large A, the pattern of STM activity ( X I l . . . ,x i ) always shows a recency gradient. For small A, the STM patterns in response to short lists show a primacy gradient. Specifically, by equations 3.8 and 3.9, (3.13)

G. Bradski, G. A. Carpenter, and S. Grossberg

278

and (3.14)

Thus at time t,, just after I, has been presented, x , - ~> xI

iff

x,-l(fl-l) > A

(3.15)

Thus if xl(tl) > A, (XI,. . . , x,) shows a primacy gradient until x l ( t l ) 5 A. Presenting additional inputs 1,+1,1,+2, . . . causes the STM pattern to bow. If x1(tl) 5 A, the STM pattern always exhibits recency. Since xl(tl) = recency occurs for all list lengths whenever A 2 1, while small A values allow relatively long lists to be stored by primacy gradients. The position at which the STM pattern bows can be calculated iteratively. For example, the bow occurs at position i = 2 if 1 > A 2 .5(3- &) 0.382. These properties of the STORE model are illustrated by the computer simulations summarized in Figure 2. Each row depicts STM storage of a list at a fixed value of A. In the left column, the STM vector (xl, x2,.. . ,x7) is depicted at times t l , t 2 , . . . , t7 when successive inputs I I , ~ .,. ,. I7 are stored. Each activity x, is represented by the height of a vertical bar. The top row depicts a recency gradient, the seventh row a primacy gradient, and intermediate columns represent bows at each successive list position. The middle column graphs the ratios x,xzl, through time. The horizontal graphs mean that the Invariance Principle is obeyed as soon as both items in each ratio are stored. The third column graphs the growth of total activity x ( t ) to its capacity S. The input durations axin equation 3.7 varied randomly between 10 and 40. Such variations in input parameters had no discernible effect on the stored STM patterns.

a,

4 A Self-organizing Architecture for Invariant 3-D Visual Object

Recognition The application summarized below of the STORE model illustrates how a working memory, whose analog STM weights code both order and item information, can substantially reduce the number of connections needed to solve temporal learning problems, and simplify the modeling of such processes. Seibert and Waxman (1990a,b, 1992; Seibert 1991) have developed a novel self-organizing neural network architecture for invariant 3-D visual object recognition. In response to moving objects in space, an Invariant Preprocessor in the architecture automatically generates 2-D patterns that are invariant under changes in object position, size, and orientation, and are insensitive to foreshortening effects in 3-D. These patterns form the input vectors to an ART 2 network (Carpenter and Grossberg 1987) that self-organizes learned category representations of the invariant patterns. Each category node encodes a 2-D ”aspect” of the object; that is, a single category node is activated by a collection of similar 2-D views of the object. The ART 2 vigilance parameter controls how

279

Working Memory Networks

2m w 2m 2m 2m 2m ....... ... ... .. ... ...

position2

0

.......... ... .. .. ...

0

...........

0

........... ...

0

........... ......

-i

0

............... ..

0

'

1'

'

0

I*,'

i

O O

400

Figure 2: STORE model simulations for decreasing values of the input parameter A. The STM patterns [XI (ti),. . . , x i ( t i ) ] show recency for large A, bowing for intermediate values of A, and primacy for small values of A. Total activity x ( t i ) = Si grows toward the asymptote S as i increases. When a new input I, is stored, the previous pattern vector ( X I , . . . , x i - l ) is amplified if Si = x ( t i ) < 1, or depressed if Si = x ( t i ) > 1; but the pattern of relative activities is preserved. For these simulations, input durations ai were varied randomly between 10 and 40, with the intervals ( t i - t i - 1 ) set equal to 50.

G. Bradski, G. A. Carpenter, and S. Grossberg

280

similar these 2-D views must be in order to activate the same category node J. Seibert and Waxman have successfully applied their system to the recognition of real 3-D objects. As the object moves with respect to the camera, a temporally ordered sequence J, ,J 2 , . . . ,Iof ,, 2-D , category nodes is activated. These nodes and their transitions implicitly represent invariant 3-D properties of the object, in much the same manner as an “aspect graph” (Koenderink and van Doorn 1979). The Seibert and Waxman model learns to respond to temporal sequences of 2-D category activations with correct outputs from 3-D Object Nodes. To accomplish this, Seibert and Waxman modeled an Aspect Network that represents all the possible painvise transitions between 2-D aspect nodes (Fig. 3a). The Aspect Network contains distinct locations N,at which sequential activation of nodes and J, are detected. The detection process at Nl1multiplies the activities xI and xI of the nodes J, and 1,. As these activities wax and wane through time, a large product xlxl denotes that a transition has recently occurred between the 2-D aspect nodes J l and 1,. The activation pattern across all the transition detectors Nl, forms the input to a competitive learning network (Fig. 3b). The output nodes of this network are called 3-D Object Nodes because the network learns to fire such a node only when an unambiguous sequence of 2-D aspect transitions is activated. An important feature of this model is its ability to recognize novel sequences composed of previously learned transitions. This approach to synthesizing 3-D recognition from combinations of distinct 2-D views is consistent with data of Perrett et al. (1987) about cells in temporal cortex that are sensitive to different 2-D views of a face. Despite its many appealing features, the Seibert and Waxman model could face two types of limitations if used in a more general context: proliferation of connections and sensitivity to input timing. As in all networks that explicitly compute pairwise or higher order correlations, proliferation of connections may occur using Aspect Graphs, although this problem did not occur in the application considered by Seibert and Waxman. In general, each different temporal order would use a different Aspect Network to compute products of the temporally overlapping STM traces of all successive input pairs at the spatial loci N,, (Fig. 3a). In order to compute all possible objects that can be represented by M distinct (and nonrepeated) 2-D Aspect Nodes J l , one needs to represent M ! temporal orderings by M ! Aspect Networks (Fig. 3b). Each Aspect Network computes O(M2)products, which require O ( M )adaptive pathways to each 3-D Object Node. In our modified architecture, the M 2-D Aspect Nodes Jz are the item nodes of a STORE model. Thus both order and item information are represented by analog activation patterns across these M codes. As a result, only one STORE model is needed with M nodes to represent all M ! temporal orders, no Aspect Networks are needed, and only O ( M ) Jf

Working Memory Networks

281

Figure 3: The Aspect Network of Seibert and Waxman detects temporal order properties by computing the temporal overlap of pairs xi and xj of activities at distinct locations Nij and then learning the pattern of overlapping traces to code transitions between 2-D aspects. (a) A single-object Aspect Network. (b) A complete multi-object Aspect Network in which each 2-D Aspect Node fans out to contact the Aspect Networks corresponding to all 3-D Object Nodes, which compete among themselves according to winner-take-all competitive learning rules. Reprinted with permission (Seibert 1991).

G. Bradski, G. A. Carpenter, and S. Grossberg

282

INPUT NAMES

OUTSTAR NETWORK

OUTPUT NAMES

(3-D OBJECTS)

(INVARIANT WORKING MEMORY)

7 (2-D ASPECTS)

INVARIANT PREPROCESSOR

INPUT IMAGES

Figure 4: Processing stages of an ARTSTORE model for invariant 3-Dobject recognition.

adaptive pathways are needed from the STORE model to each 3-D Object Node (Fig. 4). This substantial reduction in the number of connections is complemented by invariant temporal order properties and a simpler learning law. The Seibert and Waxman computation of aspect transitions using products of successive STM traces is sensitive to changes in input duration and interstimulus interval. They partially compensate for variations

Working Memory Networks

283

in input duration by using a specialized LTM law in their adaptive filters whose adaptive weights converge to 1 if the corresponding product exceeds a threshold, and zero otherwise (Fig. 3a). Such an approach cannot, however, compute the order of events separated by long interstimulus intervals pi. The working memory representation of a STORE model automatically discounts variations in input durations and interstimulus intervals. Thus the invariant temporal order code of a STORE model can directly input to a standard ART 2 network, which can automatically learn to select different 3-D Object Nodes in response to different analog patterns of temporal order information over the same fixed set of working memory item nodes. Because the model in Figure 4 joins together ART and STORE models, it is called an ARTSTORE model. The ARTSTORE model also enables each 3-D Object Node to learn an arbitrary output pattern via outstar learning (Grossberg 1968, 1978b). To accomplish this, each 3-D Object Node is the source cell of an outstar (Fig. 4). All the outstars converge on the same outstar border where an output name can be represented in an arbitrary format by an external teacher. Thus, in response to a 3-D object moving with respect to the Invariant Preprocessor, the architecture outputs an object name when enough information about the object’s 2-D aspects and their temporal order have accumulated. The total self-organizing system uses the following cascade of processing stages: Invariant Preprocessor + ART 2 (2-D Aspects) --t STORE model (Invariant Working Memory) -+ ART 2 (3-D Objects) -+ (Outstar Network). This is a self-organizing multilevel instar-outstar map specialized for invariant 3-D object recognition (Carpenter and Grossberg 1991). 5 Control of Working Memory and Temporal Learning

Reset of the working memory can be autonomously controlled by the object tracking system that Seibert and Waxman have incorporated into their Invariant Preprocessor. This system enables the architecture’s camera to continuously track a moving object. As continuous tracking occurs, a sequence of 2-D aspects is learned and encoded in working memory, after which a ballistic camera movement focuses on a new object. We assume that working memory is reset, and thereby cleared, when a ballistic movement occurs; for example, by reducing the gain of the recurrent interactions between the variables yi and xi in the STORE model. As a result, each sequence of simultaneously stored 2-D aspects represents the same 3-D object with high likelihood. ART 2 learning of each working memory pattern may be controlled in either of two ways: (1) Unsupervised learning: Here each new entry into working memory causes ART 2 to choose and learn a new category. (J1,J2), (Jl,J2,J3), . . . of 2-D aspect nodes can then Each subsequence (I,), learn to activate its own ART 2 node. Only those subsequences which

284

G. Bradski, G. A. Carpenter, and S. Grossberg

are associated with names of 3-D objects generate output predictions. (2) Supervised learning: Here an ART 2 learning gate is opened only when a teaching input to an outstar occurs. Consequently, only those sequences (JI,]~, . . .) that generate 3-D object predictions will learn to activate ART 2 categories and their outstar predictions. The number of learned ART 2 categories is hereby minimized. In either case, the ART 2 module can learn to select those combinations of item and order information that are predictive of an object by using its top-down expectation and vigilance properties (Carpenter and Grossberg 1987). 6 Concluding Remarks

The present model illustrates how a hierarchically organized neural architecture can self-organize a higher order type of invariant recognition by cascading together a combination of self-organizing modules, each of which computes a simpler invariant property. The Invariant Preprocessor computes a positionIsizelrotation invariant; the first ART 2 computes a self-calibrating similarity invariant of 2-D aspects; the STORE model computes a temporal order invariant; and the second ART 2 computes a self-calibrating similarity invariant of 3-D objects. In particular, the self-calibrating similarity invariant of 2-D aspects needs the temporal invariance of working memory to gain full effectiveness. This is so because the timing of individual outputs from the 2-D aspect nodes can depend in a complex way on the 3-D shape of an object and its relative motion with respect to the camera or other observer.

Acknowledgments The authors wish to thank Kelly Dumont, Diana Meyers, and Carol Yananakis Jefferson for their valuable assistance in the preparation of the manuscript. G. B. was supported by DARPA (AFOSR 90-0083). G. A. C. was supported in part by British Petroleum (89-A-12041, DARPA (AFOSR 90-0083), and the National Science Foundation (NSF IRI 90-00530). S. G. was supported in part by the Air Force Office of Scientific Research (AFOSR 90-128, AFOSR 90-01751, DARPA (AFOSR 90-0083), and the National Science Foundation (NSF IRI 90-24877). This is Technical Report CAS/CNS-TR-91-014, Boston University.

References Atkinson, R. C., and Shiffrin, R. M. 1971. The control of short term memory. Sci. Am. August, 82-90. Baddeley, A. D. 1986. Working Memory. Clarendon Press, Oxford.

Working Memory Networks

285

Carpenter, G. A,, and Grossberg, S. 1987. ART 2: Self-organization of stable category recognition codes for analog input patterns. Appl. Opt. 26, 49194930. Carpenter, G. A., and Grossberg, S. (eds.) 1991. Pattern Recognition by Self-organizing Neural Networks. The MIT Press, Cambridge, MA. Cohen, M. A., Grossberg, S., and Stork, D. 1987. Recent developments in a neural model of real-time speech analysis and synthesis. In Proceedings of the I E E E International Conference on Neural Networks, lV, San Diego, M. Caudill and C. Butler (eds.), pp. 443-454. IEEE, Piscataway, NJ. Cohen, M. A., and Grossberg, S. 1987. Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple grouping of patterned data. Applied Optics 26, 1866-1891. Elman, J. L. 1990. Finding structure in time. Cognitive Science, 14, 179-211. Grossberg, S. 1968. Some nonIinear networks capable of Iearning a spatial pattern of arbitrary complexity. Proc. Natl. Acad. Sci. U.S.A. 59, 368-372. Grossberg, S. 1970. Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, 11. Studies Appl. Math. 49, 135-1 66. Grossberg, S. 1973. Contour enhancement, short-term memory and constancies in reverberating neural networks. Studies Appl. Math. 52, 217-257. Grossberg, S. 1976. Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biolog. Cybernet. 23, 121-134. Grossberg, S. 1978a. Behavioral contrast in short-term memory: Serial binary memory models or parallel continuous memory models? J. Math. Psychol. 17,199-219. Grossberg, S. 1978b. A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In Progress in Theoretical Biology, VoI. 5, R. Rosen and F. Snell (eds.), pp. 233-374. Academic Press, New York. Reprinted in Grossberg, S. (ed.) 1982. Studies of Mind and Brain. Reidel Press, Boston. Grossberg, S., and Kuperstein, M. 1989. Neural Dynamics of Sensory-Motor Control. Pergamon, Elmsford, NY. Grossberg, S., and Pepe, J. 1971. Spiking threshold and overarousal effects in serial learning. J. Sfat. Phys. 3, 95-125. Grossberg, S., and Stone, G. 0. 1986. Neural dynamics of attention switching and temporal order information in short term memory. Memory Cog. 14, 451468. Gutfreund, H., and Mezard, M. 1988. Processing of temporal sequences in neural networks. Physiol. Rev. Lett. 61, 235-238. Guyon, I., Personnaz, L., Nadal, J. P., and Dreyfus, G. 1988. Storage retrieval of complex sequences in neural networks. Physiol. Rev. A 38, 636545372. Healy, A. F. 1975. Separating item from order information in short-term memory. J. Verbai Learn. Verbal Behav. 13, 644-655. Jordan, M. I. 1986. Serial order: A parallel distributed processing approach. Institute for Cognitive Science, Report 8604. University of California, San Diego.

286

G. Bradski, G. A. Carpenter, and S. Grossberg

Koenderink, J. J., and van Doorn, A. J. 1979. The internal representation of solid shape with respect to vision. Biol. Cybernet. 32, 211-216. Lee, C., and Estes, W. K. 1981. Item and order information in short-term memory: Evidence for multilevel perturbation processes. J. Exp. Psychol.: Human Learn. Memory 1, 149-169. Perrett, D. I., Mistlin, A. J., and Chitty, A. J. 1987. Visual neurones responsive to faces. Trends Neurosci. 10, 358-364. Ratcliff, R. 1978. A theory of memory retrieval. Psychol. Rev. 85, 59-108. Reeves, A., and Sperling, G. 1986. Attention gating in short-term visual memory. Psychol. Rev. 93, 180-206. Schreter, Z., and Pfeifer, R. 1989. Short-term memory/long-term memory interactions in connectionist simulations of psychological experiments on list learning. In Neural Networks from Models to Applications, L. Personnaz and G. Dreyfus (eds.). I.D.S.E.T., Paris. Seibert, M. C. 1991. Neural networks for machine vision: Learning threedimensional object recognition. Boston University, Ph.D. Thesis. Seibert, M. C., and Waxman, A. M. 19%. Learning aspect graph representations from view sequences. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (ed.), pp. 258-265. Morgan Kaufmann, San Mateo, CA. Seibert, M. C., and Waxman, A. M. l W b . Learning aspect graph representations of 3D objects in a neural network. In Proceedings of IJCNN-90, Washington, D.C., Vol. 2, M. Caudill fed.), pp. 233-236. Erlbaum, Hillsdale, NJ. Seibert, M. C., and Waxman, A. M. 1992. Learning and recognizing 3D objects from multiple views in a neural system. In Neural Networks ~ O TPerception, Vol. 1, H. Wechsler (ed.). Academic Press, New York, pp. 426-444. Wang, D., and Arbib, M. A. 1990. Complex temporal sequence learning based on short-term memory. Proc. I E E E 78(9), 1536-1543.

Received 22 April 1991; accepted 2 October 1991

This article has been cited by: 2. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 3. Lipo Wang. 1999. Multi-associative neural networks and their applications to learning and retrieving complex spatio-temporal sequences. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:1, 73-82. [CrossRef] 4. Lipo Wang. 1998. Learning and retrieving spatio-temporal sequences with any static associative neural network. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:6, 729-738. [CrossRef] 5. M.A. Cohen, S. Grossberg. 1997. Parallel auditory filtering by sustained and transient channels separates coarticulated vowels and consonants. IEEE Transactions on Speech and Audio Processing 5:4, 301-318. [CrossRef] 6. Stephen Grossberg, John W.L. Merrill. 1996. The Hippocampus and Cerebellum in Adaptively Timed Learning, Recognition, and MovementThe Hippocampus and Cerebellum in Adaptively Timed Learning, Recognition, and Movement. Journal of Cognitive Neuroscience 8:3, 257-277. [Abstract] [PDF] [PDF Plus] 7. Gary Bradski, Gail A. Carpenter, Stephen Grossberg. 1994. STORE working memory networks for storage and recall of arbitrary temporal sequences. Biological Cybernetics 71:6, 469-480. [CrossRef] 8. Daniel Bullock, Stephen Grossberg, Christian Mannes. 1993. A neural network model for cursive script production. Biological Cybernetics 70:1, 15-28. [CrossRef] 9. Mike Oaksford, Mike Malloch. 1993. Computational and biological constraints in the psychology of reasoning. Behavioral and Brain Sciences 16:03, 468. [CrossRef] 10. E. Koerner. 1993. Synchronization and cognitive carpentry: From systematic structuring to simple reasoning. Behavioral and Brain Sciences 16:03, 465. [CrossRef] 11. Garrison W. Cottrell. 1993. From symbols to neurons: Are we there yet?. Behavioral and Brain Sciences 16:03, 454. [CrossRef] 12. David L. Martin. 1993. Reflections on reflexive reasoning. Behavioral and Brain Sciences 16:03, 466. [CrossRef] 13. Lokendra Shastri, Venkat Ajjanagadde. 1993. From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences 16:03, 417. [CrossRef]

14. Reinhard Eckhorn. 1993. Dynamic bindings by real neurons: Arguments from physiology, neural network models and information theory. Behavioral and Brain Sciences 16:03, 457. [CrossRef] 15. John E. Hummel, Keith J. Holyoak. 1993. Distributing structure over time. Behavioral and Brain Sciences 16:03, 464. [CrossRef] 16. Jerome A. Feldman. 1993. Toward a unified behavioral and brain science. Behavioral and Brain Sciences 16:03, 458. [CrossRef] 17. Gary W. Strong. 1993. Phase logic is biologically relevant logic. Behavioral and Brain Sciences 16:03, 472. [CrossRef] 18. Ichiro Tsuda. 1993. Dynamic-binding theory is not plausible without chaotic oscillation. Behavioral and Brain Sciences 16:03, 475. [CrossRef] 19. Paul R. Cooper. 1993. Could static binding suffice?. Behavioral and Brain Sciences 16:03, 453. [CrossRef] 20. Joachim Diederich. 1993. Reasoning, learning and neuropsychological plausibility. Behavioral and Brain Sciences 16:03, 455. [CrossRef] 21. Günther Palm. 1993. Making reasoning more reasonable: Event-coherence and assemblies. Behavioral and Brain Sciences 16:03, 470. [CrossRef] 22. Georg Dorffner. 1993. Connectionism and syntactic binding of concepts. Behavioral and Brain Sciences 16:03, 456. [CrossRef] 23. Lokendra Shastri, Venkat Ajjanagadde. 1993. A step toward modeling reflexive reasoning. Behavioral and Brain Sciences 16:03, 477. [CrossRef] 24. Richard Rohwer. 1993. Useful ideas for exploiting time to engineer representations. Behavioral and Brain Sciences 16:03, 471. [CrossRef] 25. Stanley Munsat. 1993. What we know and the LTKB. Behavioral and Brain Sciences 16:03, 466. [CrossRef] 26. Graeme S. Halford. 1993. Competing, or perhaps complementary, approaches to the dynamic-binding problem, with similar capacity limitations. Behavioral and Brain Sciences 16:03, 461. [CrossRef] 27. P. J. Hampson. 1993. Rule acquisition and variable binding: Two sides of the same coin. Behavioral and Brain Sciences 16:03, 462. [CrossRef] 28. Walter J. Freeman. 1993. Deconstruction of neural data yields biologically implausible periodic oscillations. Behavioral and Brain Sciences 16:03, 458. [CrossRef] 29. Steven Sloman. 1993. Do simple associations lead to systematic reasoning?. Behavioral and Brain Sciences 16:03, 471. [CrossRef] 30. Stephen Grossberg. 1993. Self-organizing neural models of categorization, inference and synchrony. Behavioral and Brain Sciences 16:03, 460. [CrossRef] 31. Stellan Ohlsson. 1993. Psychological implications of the synchronicity hypothesis. Behavioral and Brain Sciences 16:03, 469. [CrossRef]

32. Michael R. W. Dawson, Istvan Berkeley. 1993. Making a middling mousetrap. Behavioral and Brain Sciences 16:03, 454. [CrossRef] 33. John A. Barnden. 1993. Time phases, pointers, rules and embedding. Behavioral and Brain Sciences 16:03, 451. [CrossRef] 34. James W. Garson. 1993. Must we solve the binding problem in neural hardware?. Behavioral and Brain Sciences 16:03, 459. [CrossRef] 35. David S. Touretzky, Scott E. Fahlman. 1993. Should first-order logic be neurally plausible?. Behavioral and Brain Sciences 16:03, 474. [CrossRef] 36. Malcolm P. Young. 1993. Ethereal oscillations. Behavioral and Brain Sciences 16:03, 476. [CrossRef] 37. Graeme Hirst, Dekai Wu. 1993. Not all reflexive reasoning is deductive. Behavioral and Brain Sciences 16:03, 462. [CrossRef] 38. Malcolm I. Bauer. 1993. Plausible inference and implicit representation. Behavioral and Brain Sciences 16:03, 452. [CrossRef] 39. Simon J. Thorpe. 1993. Temporal synchrony and the speed of visual processing. Behavioral and Brain Sciences 16:03, 473. [CrossRef] 40. Steffen Hölldobler. 1993. On the artificial intelligence paradox. Behavioral and Brain Sciences 16:03, 463. [CrossRef]

ARTICLE

Communicated by Paul Bush

A Competitive Distribution Theory of Neocortical Dynamics James A. Reggia C. Lynne D’Autrechy Granger G. Sutton 111 Michael Weinrich Departments of Computer Science and Neurology, A.V. Williams Bldg., University of Maryland, College Park, M D 20742 USA

Peristimulus inhibition in sensory pathways is generally attributed to lateral inhibitory connections. However, in the neocortex circuitry is incompletely understood at present, and in some cases there is an apparent mismatch between observed inhibitory effects and intracortical inhibitory connections. This paper studies the hypothesis that an additional mechanism, competitive distribution of activation, underlies some inhibitory effects in cortex. Analysis of a mathematical model based on this hypothesis predicts that peristimulus inhibitory effects can be caused by competitive distribution of activation, and computer simulations confirm these predictions by demonstrating Mexican Hat patterns of lateral interactions, transformation of initially diffuse activity patterns into tightly focused ”islands” of activation, and edge enhancement. The amount of inhibition can be adjusted by varying the intensity of the underlying competitive process. The concept of competitive distribution of activation provides an important perspective for interpreting neocortical and thalamocortical circuitry and can serve as a guide for further morphological and physiological studies. For example, it provides an explanation for the existence of recurrent cortex-to-thalamus connections that perform a logical AND-operation, and predicts the existence of analogous neocortical circuitry. 1 Peristimulus Inhibition

A general principle of sensory pathways in the central nervous system is the accompaniment of an excitatory stimulus by surrounding inhibitory effects. This can be observed, for example, in somatosensory and visual pathways and is reflected in the activity of neocortical cells (Mountcastle 19781, often being referred to as a “Mexican Hat pattern” of lateral interactions. In neocortex, direct activation (excitation) of a small region of cortex is known to produce peristimulus inhibition (Hess et al. 1975; Neurul Computation 4, 287-317 (1992) @ 1992 Massachusetts Institute of Technology

288

James A. Reggia et al.

Gilbert 1985), and this fact has been used to account for the tendency of neocortical activity to form focused islands of activation (Kohonen 1984). Peristimulus inhibition in sensory pathways has long been attributed to lateral inhibitory connections. While there is some controversy about the neurophysiological mechanisms underlying peristimulus inhibitory effects in neocortex, it has also generally been assumed that they are solely attributable to lateral (horizontal) intracortical inhibitory connections. According to this view, when a neocortical site is active, it suppresses activation of nearby (peristimulus) cortex because of its direct or indirect lateral inhibitory synaptic connections to nearby cortex. This perspective has also dominated work in computational neuroscience: previous mathematical/computational models of neocortex have assumed that whenever inhibitory effects are observed, they are due solely to lateral inhibitory connections or excitation of laterally displaced inhibitory neurons (e.g., Kohonen 1984; Pearson et a!. 1987; von der Malsburg 1973). At first glance this hypothesis appears quite plausible: horizontal inhibitory connections do occur in the neocortex (Jones and Hendry 1984; White 1989) and lateral inhibitory connections certainly produce lateral inhibition in prethalamic sensory pathways. In addition, to the authors’ knowledge, no other specific mechanisms for lateral inhibitory effects in the neocortex have been proposed. At present, neocortical and thalamocortical circuitry is incompletely understood. For example, the purpose of retrograde corticothalamic (cortex-to-thalamus)connections, which actually outnumber thalamic afferents to the cortex by as much as a factor of ten, is not understood. As others have put it, “One of the most puzzling riddles in neurophysiology is the function of the massive projection from the cortex back to the thalamus” (Koch 1987). It seems clear that thalamic “relay nuclei” do substantial processing of the information they receive rather than just faithfully relaying it on to cortex (Sherman and Koch 1990). Similarly, intracortical circuitry is complex and only partially defined and understood at present (Douglas and Martin 1990; White 1989). Although important lateral inhibitory connections do exist in neocortex, inhibitory (GABAergic) neurons are in the minority and are themselves the recipients of numerous inhibitory synapses (Houser et al. 1984). Most putative inhibitory connections appear to be vertical and therefore intrucolumnar (Somogyi et al. 1981). Those inhibitory connections that are lateral (e.g., basket cells), although potent, are relatively sparse and in some cases their distribution appears to be mismatched to the distribution of peristimulus inhibition. For example, it has long been known that electrophoretic application of glutamate, a putative excitatory neurotransmitter, to neocortex produces excitation of a small volume of tissue with surrounding inhibitory effects. The peristimulus inhibition is maximal at a distance of 100 to 200 pm from the glutamate stimulus, and lateral excitation, even of inhibitory cells, is not observed. As others have pointed out (Hess et al.

Competitive Distribution Theory of Neocortical Dynamics

289

1975), these observations are puzzling and unexpected. This is because, among other things, following small intrinsic neocortical lesions the degenerating synaptic terminals over the region where maximal peristimulus inhibition occurs are overwhelmingly excitatory (asymmetric; Type I synapses) (Fisken et al. 1973, 1975). Degenerating terminals that are presumably inhibitory (symmetric; Type 11) are present but are relatively sparse and occur mainly outside of the peristimulus inhibitory zone (Gatter et al. 1978). Thus, direct horizontal inhibitory connections as pictured in Figure l a seem to be an unlikely mechanism for peristimulus inhibition in this situation. Further, peristimulus inhibition also apparently cannot be fully explained by indirect lateral inhibitory connections either (Hess et al. 19751, that is, by lateral excitation of intracolumnar inhibitory neurons (Fig. lb,c). Excitatory horizontal connections in cortex generally terminate on excitatory (spiny) neurons (Gabbott et al. 1987; Kisvarday et al. 1986; LeVay 1988) and lateral excitation of inhibitory neurons is generally difficult to find (Hess et al. 1975), although recent studies have demonstrated its occurrence over longer (1 to 3 mm) distances (Hirsch and Gilbert 1991). While there is substantial uncertainty about the details of neocortical circuitry, the point being made here is simply that the mechanisms underlying peristimulus inhibition in cortex are not definitively established. The examples above indicate that there are some observations that are difficult to account for with the notion that such inhibition arises solely from direct/indirect horizontal inhibitory connections. This paper puts forth the competitive distribution hypothesis to explain how some peristimulus inhibitory effects in neocortex may arise from mechanisms other than inhibitory lateral connections. A model of neocortical dynamics based on the competitive distribution hypothesis is presented. Mathematical analysis of this model predicts that peristimulus inhibition will occur due to competitive distribution and indicates how the intensity of this inhibition can be adjusted. Computer simulations with the model confirm these predictions by demonstrating Mexican Hat patterns of interactions, formation of focused islands of activation from initially diffuse random activity, and edge enhancement effects, phenomena generally attributed exclusively to lateral inhibitory connections in the past. The concluding discussion of the paper describes how competitive distribution might relate to neocortical circuitry and might explain the existence of cortex-to-thalamus connections. 2 Competitive Distribution Hypothesis

How could peristimulus inhibition arise in the presence of lateral/horizontal connections that have purely excitatory effects (Fig. Id)? To understand the competitive distribution hypothesis, it is convenient to view the cortex and thalamus abstractly as two-dimensional sheets (see Fig. 2).

James A. Reggia et al.

290

a.

b

a

X’

‘X

b

i

i

I

i

i

C.

i

d

i

k

i

k

i

Figure 1: Transverse sections through idealized cortex showing several cortical elements. Excitatory connections indicated by thick solid connectors, inhibitory by thin open-ended connectors. (a) Direct lateral inhibitory connections. Element j inhibits elements i via direct inhibitory synaptic connections. (b,c) Indirect lateral inhibitory connections. Element j inhibits elements i indirectly by excitation of inhibitory neurons (circles) in elements i. (d) Competitive distribution of activation. Each cortical element has only excitatory connections to excitatory neurons in neighboring elements. Elements such as those labeled k distribute their activation competitively as explained in the text. Element j thus appears to inhibit elements i because when j is successful in this competition the amount of activation elements k send to elements i diminishes. Inhibitory neurons exist but synapses on them and from them are contained within individual elements.

Competitive Distribution Theory of Neocortical Dynamics

291

Afferents

Figure 2: Generic sensory pathway via thalamic relay nucleus (e.g.,lateral geniculate nucleus) to region of primary sensory neocortex. Neocortex and thalamus are viewed as two-dimensional sheets of volume elements. Each thaIamic element x sends connections to a set of cortical elements (circular region) centered on corresponding cortical element x'. Each sheet is divided into small volume elements, or just elements,' where each element represents a small set of spatially adjacent and functionally related neurons. An element in neocortex will represent a small patch of cortex of diameter less than 100 pm that extends through all layers and contains on the order of 100 neurons. This roughly corresponds to the concept of a cortical column. Each thalamic element sends connections to a neighborhood around the corresponding element/column in cortex. Each cortical element projects connections to nearby elements in cortex (not shown in Fig. 2). In the following, it is critical to keep in mind the distinction between the phenomenon of inhibition, that is, the existence of an inhibitory 'Conceptually, the level of abstraction here is one level above that in most computational neuroscience models that simulate individual neurons and their synaptic connections. The atomic units in our model are not neurons and their connections, but volume elements representing sets of neurons, and connections representing resultant connection strengths composed of many individual synaptic connections.

292

JamesA. Reggia et al.

functional relationship, and the underlying mechanism by which that inhibitory relationship is brought about. We are considering two possible mechanisms for inhibitory effects here. First, when element j inhibits element i it may do so by direct or indirect inhibitory connections, as in Figure la-c. This corresponds to the widely held view of lateral interactions in neocortex. A second possible mechanism for inhibitory effects, which we propose here, is the competitive distribution of activation. Suppose that elements i and j both receive activation from element k, as in Figure Id. Suppose that element k has a finite amount of activation to distribute at any moment in time and that elements i and j actively compete to receive that activation from element k. Then if element j is somehow successful in increasing its share of activation output from element k, this will decrease the activation being sent from k to i (since k has only a finite amount of activation to distribute). Thus, functionally, element j might be expected to appear to inhibit i even though it does not directly or indirectly have inhibitory connections to i, a behavior that we refer to as virtual lateral inhibition. The same concept can be applied to thalamocortical interactions. Suppose that a thalamic element sends connections to a group of cortical elements, which compete with one another for the activation from that thalamic element (Fig. 2). Then to the extent that one cortical element is successful in obtaining activation from the thalamic element, it will siphon thalamic output away from its competitors and thus give the appearance of inhibiting them. This situation is analogous to a number of familiar competitive processes, such as long-distance telephone companies competing for customers or animal populations competing for the same food supply. While competitive distribution may be a familiar concept in general, at first such a mechanism might appear biologically implausible in the neocortex. After all, individual neurons certainly do not behave in this fashion. However, while individual neurons may not exhibit competitive distribution of their activity, there is no a priori reason to assume that a set of neurons, such as those forming a volume element (column) in neocortex, do not. Accordingly, the following hypothesis can be made: Competitive distribution becomes an important factor in controlling the spread of activation at the h e 1 of the thalamus and cortex. More specifically, this mechanism can be proposed to act in two ways: 1. When a neocortical element is active, it competitively distributes its activation among neighboring cortical elements to which it has excitatory connections; and 2. when a thalamic relay nucleus element is active, it competitively distributes its activation among cortical elements to which it projects. The implication is that some peristimulus inhibitory effects at the level of the cortex are in part a result of one or both of these competitive processes rather than being solely due to lateral intracortical inhibitory connections. It is useful to be explicit about what the competitive distribution hypothesis does not say. It does not say that individual neurons compet-

Competitive Distribution Theory of Neocortical Dynamics

293

itively distribute their output. Nor does it say that competitive distribution of activation is the only mechanism that brings about inhibitory effects in the cortex. For one thing, as described later, the competitive distribution hypothesis requires strong self-inhibitory effects in cortex for meaningful behavior. Self-inhibition, whereby an element has a recurrent inhibitory connection to itself, is entirely consistent with the fact that many inhibitory connections in cortex follow a vertical path (perpendicular to the pial surface) and are therefore intracolumnar. Further, the competitive distribution hypothesis does not say that competitive distribution of activation is the only mechanism causing peristimulus inhibition in the cortex. The occurrence of lateral inhibitory connections and competitive distribution of activation are not mutualIy exclusive, and lateral inhibitory connections are known to exist. One possibility is that these two mechanisms cooperate in various ways, for example, operating over different horizontal distances of cortex and thereby serving different functional roles. Finally, the competitive distribution hypothesis is unrelated to competitive synaptic alterations, a concept used in some previous neocortical models (Kohonen 1984; Pearson et al. 1987; von der Malsburg 1973). These past models involve changes to synaptic efficacy reflecting adaptation rather than the moment-to-moment distribution of cortical activity that is of concern here. The competitive distribution hypothesis does not involve changes to synapses at all, and competitive distribution of activation occurs on a time scale that is qualitatively faster than the synaptic alterations in previous work. 3 A Specific Formulation

Could inhibitory effects observed in cortex be due in part to the competitive distribution of activation rather than, as generally accepted, entirely due to lateral inhibitory connections? A first step in answering this question is to develop a model to demonstrate the basic principles involved. Accordingly, an abstract model of cortex and thalamocortical interactions based on the competitive distribution hypothesis is now described. The model abstracts away from detailed biophysical mechanisms at the level of individual neurons and synapses by having atomic units which represent small volumes of brain tissue, and as such can be classified as a “simplifying brain model” (Sejnowski et al. 1988). The model cortex described here is constructed in such a way that it can potentially incorporate both lateral inhibitory connections and competitive distribution of activation, either together or alone. By suitable selection of parameter values, two special versions of the general model are created. The first version, called Modell for ”inhibitory connections,” uses only lateral intracortical inhibitory connections to bring about peristimulus inhibition. The second version, called Model C for “competitive distribution,’’ does not have any lateral intracortical inhibitory connections.

294

James A. Reggia et al.

Model C has only excitatory connections and elements that distribute their activation competitively, so it permits one to examine the effects of competitive distribution in isolation. Model C represents an "experiment" while Model I serves as a "control." Developing and focusing on two "pure" models 1 and C as variants of the same general framework, rather than two separate and independent models, facilitates a clean comparison of the two mechanisms of inhibitory effects. We first describe the general model, and then specify how this general model is altered to derive the special cases Model I and Model C . 3.1 Network Structure. The overall network structure of the general model consists of two two-dimensional sheets of volume elements representing cortex and a thalamic relay nucleus (Fig. 2). Each cortical element, for example, can be viewed as a column of 100 pm or less diameter, containing 100 to a few hundred neurons. Sensory afferents enter the thalamic layer, and thalamic elements then project in a divergent but topology-preserving fashion onto cortex. Figure 3 shows small patches of simulated cortex that illustrate its tesselation into small hexagonal volume elements. The distance between any two volume elements is measured as the minimal number of element boundaries that must be crossed to move between the two elements. Define Ni(r) to be the set of elements equidistant from element i at a distance or radius r. For example, Ni(1) consists of the six elements at r = 1 that are contiguous to element i, N,(2) consists of the ring of 12 next-nearest elements at Y = 2, and so forth. Each volume element in cortex sends connections to nearby elements in cortex. Just as a volume element represents many neurons, a single connection in this model represents multiple synaptic interactions. The distribution of intracortical connections is illustrated for a single representative element i in Figure 3a. This element sends excitatory connections to the six contiguous elements in Ni(l), and inhibitory connections to the 12 next-nearest elements in Ni(2). The width of these excitatory and inhibitory annuli is not critical to what follows; we select a width of 1 here for both excitatory region and inhibitory penumbra as the simplest and most computationally tractable case that illustrates the principles at issue.* Finally, each node has a self-connection that is always inhibitory. The connection strength on the connection to element i from element j at time t is designated as ci,(t), where q ( t ) = 0 if j does not send a connection to i. Certain gain restrictions apply, as follows. For elements k in Nj(l) contiguous to element j we have ck,(t) > 0 and the restriction & E N , ( I 1 c k j ( f= ) cp where cp > 0 is a network-wide constant referred to as the positive or excitatory gain. Similarly, for next-nearest elements k 2Real lateral intracortical connections extend over a much wider region. If the model is altered to have broader connections then the sums in the gain restrictions must be extended to be over these wider excitatory and inhibitory regions.

Competitive Distribution Theory of Neocortical Dynamics

295

r=1

b.

p$j r-2

r=2

Figure 3: Small patches of neocortex. (a) Intracortical connectivity. A “+” indicates excitatory connection; a ”-” indicates inhibitory connection. Each element i sends an excitatory connection to its six adjacent elements and an inhibitory connection to the next-nearest 12 elements. (When competitive distribution is used by cortical elements, the inhibitory connections are absent.) (b) Thalamocortical connectivity. The thalamic element directly under the center cortical element here sends excitatory connections to the cortical region of diameter seven pictured here. These connections are more strongly weighted near the center of the region as explained in the text. (c,d,e) Cortical elements labeled j form the set Nik(1) = Ni(1) nNk(l).(c) T = 1; (d,e) T = 2. in Nj(2) we have cv(t) < 0 and the restriction &N,(2)~kj(t) = c, where < 0 is a network-wide constant referred to as the negative or inhibitory gain. Finally, each element j has an inhibitory connection to itself with constant strength cjj(t) = cs, with cs < 0. Gain parameters cp, ,c, and cs c,

James A. Reggia et al.

296

characterize the model cortex in important ways (Reggia and Edwards 1990). Thalamic elements are arranged in a two-dimensional hexagonal array identical to that of cortex. Since our primary interest here is in thalamocortical interactions and not in thalamus, no explicit intrathalamic connections are modeled. Each thalamic element simply receives a direct input connection representing sensory input. Each thalamic element sends a cluster of excitatory connections to a neighborhood of cortical elements of diameter seven centered on the topologically corresponding cortical element (Fig. 3b). These 37 connections are weighted so as to be maximal/densest in the center and weaker as one moves away from the center, as indicated schematically in Figure 3b. For thalamic element k and cortical element I , connection strength q k ( t ) > 0 so all thalamocortical connections are excitatory. We require CIEG c,k(t) = ct where ct > 0 is the thalarnocortical gain and Ck is the set of all cortical elements to which thalamic element k sends connections. 3.2 Activation Rule. Each cortical element i has a nonnegative activation level a ; ( t ) representing the mean firing rate of the neurons it contains. At any time t, element i sends activation of the amount ckia; to each element k to which it is directly connected. Simultaneously, element i is receiving input activation3 in; from neighboring elements given by

where bi > 0 is a constant bias, e; represents external input (e.g., electrophoretic application of glutamate), and Ti is the set of thalamic elements connected to cortical element i. The first three terms here represent excitatory input from contiguous elements, inhibitory input from nextnearest elements, and thalamic excitatory input, respectively. The rate of change in element i's activation level is da; - = ini(M - ai) c,ai (3.2) dt where M > 0 is a cortex-wide constant. The term c,ai represents a self-

+

inhibitory connection (recall c, < 0). At equilibrium a; = M/(1 - c,/ini) where M represents the maximum possible activation level of an element. By placing a floor of ai = 0, an element's activation level is thus always maintained in the range from 0 to M if all elements start in that range. Thalamic elements obey the same equations 3.1 and 3.2 concerning updating of their activation levels. However, thalamic elements are of secondary concern here, so the model of their dynamics is especially simplified. In particular, the right-hand side of equation 3.1 reduces to e; bi, and parameter values b, = 0 for all i, c, = -0.1, and M = 1.0 are used, giving da,/dt = e;(l - a,) - 0 . 1 ~for ; thalamic elements.

+

3Here and in the following the parameter t is omitted from equations for brevity.

Competitive Distribution Theory of Neocortical Dynamics

297

4 Models I and C

We now create two specific versions of the general model presented in the previous section and refer to them as Models I and C. Model I (”inhibitory connections”) restricts connection strengths cij to be fixed in value and has both inhibitory and excitatory connections. Restricting connection strengths to be “fixed does not exclude the possibility of slow changes that might occur during adaptation; such slow changes during learning occur on a different time scale than that of concern here and can be ignored for our current purposes. Model C (“competitive distribution”) has no inhibitory connections (c, = 0) but the remaining excitatory connections have strengths that vary with time to bring about the competitive distribution of each element’s output. Model C is thus our ”experiment” while Model I is our ”control.” Since Models I and C are special cases of the same general framework, they are identical except for the inclusion of fixed-strength inhibitory connections in the former versus the exclusive use of excitatory but timevarying connections in the latter. More specifically, for Model I all connection strengths are constant. Each cortical element k E Ni(1) adjacent to cortical element i receives an excitatory connection of strength CE = cp/6 from element i where cp > 0. Since inhibitory connections extend out only one unit of distance farther than excitatory ones, for k E Ni(2) the connection strength is cki = c,/12, where c, < 0. Since element i has 6 contiguous elements at r = 1 and 12 connected elements at Y = 2, it follows that in either case the gain restrictions are satisfied. Thalamocortical connections are exclusively excitatory and also have fixed connection’strengths. For thalamic element j and cortical element k, C k j ( f ) = wkj where constant Wkl depends on the distance r of element k from the cortical element j that is topologically equivalent to thalamic element j . Specifically, wkj is determined by a bellshaped (normal, gaussian) function (l/s@ e-1/2(r/s)2with parameter s. For example, with s = 2.0 the thalamocortical connection strengths wkj are approximately 0.199, 0.176, 0.121, and 0.065 for Y = 0, 1, 2, and 3, respectively. In contrast to Model I, Model C is created from,the general cortical model by eliminating all inhibitory lateral connections in cortex (c, = 0). The only remaining inhibitory influences are self-inhibitory intraelement connections of fixed strength cii = c, where cs < 0. Now, however, the remaining intracortical excitatory connections have time-varying strengths, representing the first part of the competitive distribution hypothesis. For each cortical element k E Ni(1) contiguous to cortical element i, (4.1)

where v > 0 and 9 > 0 are network-wide constants. We assume v = 1 unless explicitly noted otherwise. Although cki > 0 and the gain restrictions

James A. Reggia et al.

298

are satisfied, these dynamically varying excitatory connections differ substantially from the static excitatory connections in Model I. For example, all cortical interelement connection strengths in Model I are symmetric (cki = cik), whereas that is clearly not the case in general for Model C. Equation 4.1 provides for the competitive distribution of activation (Reggia et al. 1991). The sum in the denominator assures that the total output of element i is cpa,. A fraction of this output goes to each adjacent element k in proportion to its activation level. Thalamocortical connections are again exclusively excitatory and also have time-varying connection strengths, representing the second part of the competitive distribution hypothesis. For thalamic element j and cortical element k, (4.2)

where Cj is the set of cortical elements to which thalamic element j sends connections, and wmj are constant weights4 whose values are determined in the same way as for Model I. Again v = 1 is assumed unless explicitly noted otherwise. Thus Model C has both cortical and thalamic elements that distribute their output activation competitively. A n important point made explicit in equations 4.2 and 4.2 is that elements that distribute their activation competitively imply the existence of retrograde influences. For example, thalamic element j determines its output c ~ a to j cortical element k as a function of a k (see numerator in equation 4.2). Thus, the actual neural synaptic connections implementing the functional thalamocortical connections in Figure 3b would have to include axons from cortical neurons back to thalamic neurons (see Discussion). 5 Prediction of Interelement Relationships

Model I resembles previous simulations of cortex in its use of lateral intracortical inhibitory connections to bring about peristimulus inhibition. Model C represents the hypothesis that the same behavior can be achieved through the use of competitive distribution of activation. In this section we show analytically that, under appropriate conditions, Model C will in fact produce peristimulus inhibitory effects, and contrast these effects with those seen with Model I. Cortical elements in both Model I and Model C are governed by an activation mechanism of the form daildt = fi(a) where a represents the current state of the network’s activation levels, and where fi = ini(M ai) csai according to equation 3.2. In mathematical systems of this form, it is customary to say that an element k directly inhibits another element i if af/aak < 0, that an element k directly excites another element i if

+

4Equation 4.2 becomes almost identical to equation 4.1 when all connections have identical weights.

Competitive Distribution Theory of Neocortical Dynamics

299

afi/aak > 0, and that element k is neutral with respect to element i if neither of these conditions hold5 (Freedman 1980; Grossberg 1980; Hirsch 1989; Lotka 1924). These definitions describe the direct functional relation between two elements but say nothing about the underlying mechanism that brings about that relationship. Further, they are limited to capturing only the direct relationship between two elements. They do not reflect indirect relationships, which certainly occur in a recurrently connected network like that modeled here. By equation 3.2 with k # i we have

Since (M- a i ) 2 0, the sign of dini/dak determines whether element k is inhibitory, excitatory, or neutral with respect to element i, and for a given ai the magnitude of dinildak determines the strength of this relationship. In examining equation 5.1 in the following, we initially ignore thalamic influences on cortex, assuming ct = 0. In the final part of this section this situation is reversed and thalamic influences are considered while intracortical lateral connections are ignored. 5.1 Model I. First consider Model I, which we analyze solely for comparison with Model C. According to equation 3.1, since connection strengths ci, are fixed, for any two cortical elements i and k

where nl = INi(l)l = 6 and n2 = INi(2)(= 12. Consider element i to be located at progressively increasing distance r from element k. Then, noting that daj/dak = 0 unless j = k, it follows that for r = 1, dini/dak = cp/nl > 0, while for r = 2, dini/& = c,/ii~ < 0, and for r 1 3, dinildak = 0. Thus, in Model I, any cortical element k directly excites its contiguous elements Ni(l), directZy inhibits its next-nearest elements Ni(2),and exerts no direct influence on more distant elements. These relations are static. This Mexican Hat pattern of interactions is exactly as expected and has been observed in previous models. The term static is used here to indicate that these relations between i and k do not change as a function of neighboring activation levels (dinildak is constant in time). 5.2 Model C. As noted earlier, we ignore thalamic influences on cortex initially, but return to this issue in the final part of this section. In contrast to Model I, with Model C connection strengths are not constant ~~

'Such systems are often said to exhibit "competitive" ("cooperative") relationships rather than inhibitory (excitatory)relationships. We use the latter terms to be more consistent with neurobiological terminology and to avoid confusion'with the term "competitive distribution of activation."

300

James A. Reggia et al.

and cn = 0. In this case, for any two cortical elements i and k separated by a distance Y = 1 it can be shown (see I, Appendix; ZI = 1) that (5.3) where N,k(l) = N , ( l ) n N k ( l ) the , two elements adjacent to both i and k (see Fig. 3c). This result is quite remarkable. Since the individual quantities (Ck,cP,u,, etc.) on the right side of equation 5.3 are nonnegative, it follows that din,/auk can be positive, negative, or zero. Thus, while element k may directly excite adjacent element i as expected, it may also be neutral with respect to i or even directly inhibit element i. Further, unlike with Model I, the relationship between elements i and k is dynamic: exactly which possible relationship holds at any moment is a function of the pattern of activation that is present. That element k may at times directly inhibit element i may seem surprising, given the positively weighted connection between them. However, some reflection shows that this is intuitively plausible. Suppose first that all elements have a zero activation level except for adjacent elements i and k, which are nonzero, and let q be near zero. In this case, ain,/dak = cp (see 111, Appendix), so as expected element k directly excites element i. On the other hand, suppose one of the elements j that is adjacent to both i and k (Fig. 3c) is also nonzero. As ul is increased, everything else being equal, din,/auk decreases until for a sufficiently large value of a,, ain,/duk becomes negative, and k has an inhibitory relation with i (see 111, Appendix). In this situation where a, is large so that it contributes a significant amount of input to element i, element k diverts some of j’s output away from i. In this case, element k has an inhibitory influence on i because it competes with i for j’s output. Thus, elements adjacent to each other can directly excite or inhibit one another depending on the pattern of activation that is present. Finally, consider the situation where all elements have the same activation level uo. In this situation ain,/dak > 0 always (see 111, Appendix), so immediately adjacent elements always have an excitatory relationship. Such a situation occurs, for example, in a network with elements that are all at a natural resting value. Now consider elements i at a distance r = 2 from element k. In contrast to Model I, there are no connections between k and these elements, so one might anticipate a neutral relationship. Instead, it can be shown (see 11, Appendix) that (5.4) where as before, the summation is over elements j contiguous to both i and k (Fig. 3d,e). Note that equation 5.4 is identical to equation 5.3 except there is no c;k term, reflecting the absence of a direct connection between k

Competitive Distribution Theory of Neocortical Dynamics

301

and i. According to equation 5.4, an element k always inhibits elements i at a distance r = 2, although the amount of inhibition varies as a function of local activation levels. This result can be understood intuitively if one recognizes that elements i and k both compete for activation from any mutually adjacent node j . Thus, an increase in k's activation increases the proportion of j's activation going to k and, accordingly, decreases the proportion going to i. Finally, for r 2 3 there are no elements adjacent to both i and k. In this case, as for Model I, dini/dak = 0, and k is neutral with respect to i. We thus have the following result: In Model C, any cortical element k directly excites its contiguous elements Ni( l),although in some contexts this relationship may become inhibitory. Element kdirectly inhibits its next-nearest elements Ni(2), and exerts no direct influence on more distant elements. These direct relationships are dynamic. As with Model I, a Mexican Hat pattern of interactions is anticipated, in spite of the fact that no lateral inhibitory connections are present in Model C. The term dynamic is used here to indicate that the relations between i and k change as a function of neighboring activation levels (dinildak is not constant). In this context it is interesting to note that if q is very large ( q >> M ) , at r = 1 the value dini/dak = cp/nl and at r = 2 the value 8ini/dak M 0. In this case Model C becomes like Model I but without lateral intracortical inhibitory connections or effects ( c i k values become constant), and peristimulus inhibition is abolished.

5.3 Thalamic Influences. So far in this analysis thalamic influences on cortex have been ignored (c, = 0 assumed). This situation is now reversed to assess thalamic influences on cortex. In the following, we assume that ct > 0 but that intracortical lateral connections are removed (cp = 0, c, = 0 ) , and address in isolation the effects that thalamic elements have on cortex. Let k be a thalamic element that sends a connection to cortical element i. Then for Model I, dinildak = wik > 0, and for Model C, dinildak = cik 1 0, so in both models thalamic elements have a direct excitatory relation with the cortical elements to which they project (see IV, Appendix). This relationship is static in Model I and dynamic in Model C. On the other hand, in both models if thalamic element k does not send a connection to cortical element i then dinildak = 0, and element k is neutral with respect to element i. Now reconsider the relationships between two cortical elements i and k under the same assumptions that lateral intracortical connection strengths are negligible (cp = c, = 0) while thalamocortical connection strengths are not (ct > 0). For Model I (see V, Appendix), as would be expected in the absence of lateral intracortical connections, dinildak = 0, so any two cortical elements i and k have a neutral relationship? On the other hand, for Model C the situation is more interesting because thalamic 6Thiswould be true even if corticothalamic connections existed in Model I (they do not) since our definitions measure only direct excitatory/inhibitory relationships.

302

JamesA. Reggia et al.

elements are assumed to distribute their activation competitively. In this situation two cortical elements i and k receiving connections from a common thalamic element have an inhibitory relationship because

(5.5) where Tik is the set of thalamic elements that send a connection to both cortical elements i and k (see VI, Appendix). This quantity is always nonpositive. Thus, even in the absence of lateral intracortical connections two cortical elements have an inhibitory relationship if they receive connections from common thalamic elements. This makes sense intuitively: since such a thalamic element j competitively distributes its output activation to cortical elements i and k, an increase in ak would tend to diminish the input received by element i, and vice versa. 6 Simulation Results

The analysis above makes the prediction that competitive distribution of activation will produce peristimulus inhibition similar to that observed in cortex. However, this analysis accounts only for direct interactions between neural elements; it says nothing about the substantial indirect interactions (e.g., disinhibition) that can occur in the recurrently connected networks. Computer simulations were therefore undertaken to verify that competitive distribution of activation produces peristimulus inhibition in Model C as predicted, to demonstrate that this phenomenon is robust to variations in the details of Model C, to systematically compare the cortical activation patterns obtained in Model C (experiment) versus those obtained with Model I (control), and to assess the relative contribution in Model C of intracortical versus thalamocortical competitive distribution of activation in producing peristimulus inhibition. Over 700 simulations have been run with variations in activation rules, network structure, network parameters, and input patterns. The results of a representative subset of these simulations are summarized in this section? All simulations had networks with opposite edges connected together to avoid edge effects. The number of connections involved varied with network size and model details. For example, for a simulation with Model I having 35 x 32 cortical elements and connections from each cortical element to its neighbors up to a distance of r = 3, there are 40,320 intracortical connections, plus 37 additional thalamocortical connections for each thalamic element. The bias bi of all cortical elements was always the same value b 2 0 in any given simulation; b determines the natural 7Simulations used an Euler method with double-precision arithmetic in a general purpose neural modeling system (DAutrechy et al. 1988) under UNIX on SUN 3 or VAX-class machines.

Competitive Distribution Theory of Neocortical Dynamics

303

resting value to which element activation levels decay in the absence of external input (see equation 3.1). All simulations began with all elements in their natural resting state. Following the onset (or change) of external input el, a simulation was said to have reached equilibrium when all element activation levels changed by less than 0.0005 between two consecutive time steps. In the following, the term point stimulus refers to an input pattern in which one unit of external input is applied to a single model element. Peristimulus inhibition is demonstrated clearly and consistently in different versions of Model C by the cortical patterns of activation observed at equilibrium following thalamic stimuli, and its occurrence is relatively insensitive to limited variations of cp, c,, and c,. A point stimulus in which a single thalamic element is persistently activated results in a classic Mexican Hat pattern of activation in cortex that is best seen when all cortical elements have small but nonzero activation levels (Fig. 4). Similar but broader patterns with more intense inhibitory effects are observed for small "spots" of thalamic activation, such as activation of seven contiguous thalamic elements arranged in a hexagon. When all cortical elements have a zero resting activation level, cortical peristimulus inhibition can be identified by the evolution of an initially broad activation pattern into a smaller, more intense "island" of activation as shown in Figure 5. This phenomenon is often attributed to lateral inhibitory connections (Kohonen 1984). As a result of this effect, a diffuse, random pattern of activation across the thalamus produces circumscribed islands of cortical activation separated by inactivpregions. This pattern of ideaIized cortex activation is reminiscent of patterns of electrical and metabolic (2-deoxyglucose)activity seen in neocortex (Juliano et al. 1981). Another way to demonstrate peristimulus inhibition in Model C is to show that cortical elements activated by thalamic afferents are inhibited by simulated nearby electrophoretic application of an excitatory neurotransmitter such as glutamate. For example, Figure 6 illustrates the response of a cortical element to a fixed thalamic stimulus as the direct application of glutamate to progressively more distant neighboring cortical elements occurs (compare to Hess et al. 1975). Particularly striking is the fact that stimulation of a cortical element inhibits its immediate neighbor ( Y = l), even though there is an excitatory connection between the two (see Fig. 3a). These results qualitatively reproduce those seen biologically where inhibitory effects appear between elements having predominantly excitatory synaptic connections (Hess et al. 1975). When activation patterns consisting of large adjacent regions of activation and inactivation (but no enhancement of borders) are input to the thalamus in Model C, border enhancement appears in the resulting activation patterns in the cortex (see Fig. 7). If the border between active and inactive regions is less abrupt (i.e., has a gradual slope), this phenomenon persists but is diminished. These results demonstrate that competitive distribution of activation can cause or sharpen border

James A. Reggia et al.

304

Activation

Level

12

1.o

0.8

0.6

04

0.2

-12

-9

-6

-3

0

3

6

9

12

Cortical Distance

Figure 4: Cross section through "Mexican Hat" pattern of cortical activation occurring in response to activation of a single thalamic element in Model C. Peristimulus inhibition is clearly evident at a distance r = 2 and r = 3 from stimulus. All cortical elements in this simulation have a natural resting activation level of 0.1, indicated by the horizontal dotted line. M = 3, s = 1, cp = 0.6, cn = 0.0, cs = -2.0, time step S = 0.5, b = 0.009. enhancement just as can lateral inhibitory connections. Thus, border enhancement is attributable to the phenomenon of lateral inhibition in general, and does not require that the mechanism of that inhibition be lateral inhibitory connections as is often supposed. Competitive distribution of activation occurs in two independent ways in Model C: via intracortical mechanisms and /or via thalamocortical mechanisms. However, there is no a priori reason to require that competitive distribution manifest itself in both ways: the analysis above shows that peristimulus inhibition would be expected with either mechanism alone. Computer simulations, done in a variety of ways, confirm this expectation. For example, one way to examine intracortical competitive distribution in isolation is to directly activate cortical elements. This can be done in Model C by supplying external input ei to cortical elements (see equation 3.1), simulating the direct application of an excitatory stimulus to element i. In this situation the competitive distribution of activation

Competitive Distribution Theory of Neocortical Dynamics

4

-3

-2

I

I

I

1

0

1

2

305

3

4

Cortical Distance

Figure 5: Response of Model C when the natural resting activation level of cortical elements is zero (same stimulus, parameters as Fig. 4 except s = 2, b = 0). The pattern of cortical activation evolves over time from flat and broad (1) to peaked and focused (5).

by thalamocortical elements has no influence on the cortex because all thalamic elements have zero activation. When direct activation of cortical elements is done in this fashion with the same stimuli patterns described above, qualitatively similar activation patterns are observed in cortex except the cortical response patterns are more focused with sharper borders and higher peaks. Similarly, one way to examine thalamocortical competitive distribution of activation in isolation is to remove intracortical lateral connections in Model C (cp t 0). Presenting the same input patterns to thalamus results in patterns of cortical activation qualitatively similar to those described above that exhibit peristimulus inhibition. The primary difference is that now the overall amplitude of cortical responses is substantially smaller because of the absence of horizontal excitatory connections. In summary, it is not necessary that both cortical elements and thalamic elements competitively distribute their activation for peristimulus inhibition to occur; either alone is sufficient.

James A. Reggia et al.

306

C

.-5 m .-

..->

c

9

.-

0.5 l 0. 0i ” , 100

0

0

100

200

300

400

200

300

400

9

Iteration

.-

I

-

.”9

9

1

:

L

l

,~

.-

1 0.5

0 0

l.!p Iteration

0.5

100

200 Iteration

300

400

0

100

200 Iteration

300

400

Figure 6: Simulation of the inhibitory effects of electrophoretically applied glutamate on activation levels of cortical elements. Each graph shows the activation level ai of a single cortical element over time. In each case element i is initially at the center of an island of activation like that in Figure 4. Application of glutamate to the same (Y = 0) or a nearby (r = 1,2,3)cortical element is simulated by an external input ei directly to that cortical element for 150 time units starting at t = 61. During the simulated application of glutamate to the same cortical element i (r = 0, upper left), ai increases and remains high. However, when application of glutamate to neighboring cortical elements at r = 1, 2, and 3 is simulated (remaining three graphs), ai decreases and element i appears to be inhibited in spite of the absence of horizontal intracortical inhibitory connections. This effect also occurs at r = 4 (not shown) but is diminished. Particularly striking is the pattern at r = 1 (upper right quadrant): the application of glutamate to a cortical element j immediately adjacent to element i inhibits i (after brief transient) in spite of the excitatory connections between these two elements. Compare to experimental data in Hess et al. (1975). Parameters as in Figure 4 except 6 = 0.25.

It is possible to vary systematically the intensity of peristimulus inhibition in Model C by altering the constants v and q in the connection strengths cki (equations 4.1 and 4.2). As might be expected from analytic

Competitive Distribution Theory of Neocortical Dynamics

1.6

307

Activation Level

1.4

1.2

1.o

0.8

0.6

0.4

0.2

-12

I

I

I

I

8

4

0

4

I

12

16

Distance

Figure 7 Cross section through cortex activation pattern (solid line) that occurs in response to a broad band of afferent activation applied to thalamus (dotted line). Note the edge enhancement and peristimulus inhibition. considerations (Benaim and Samuelides 1990) a wide variety of simulations demonstrate that increasing the value of z, or decreasing the value of 9 intensifies peristimulus inhibition. For example, in the simulation results described above, ZI = 1 and typically 9 = 0.01. If these global parameters are changed to z, = 3 and 9 = 0.0075 then all peristimulus inhibitory effects are intensified. Stronger inhibition is manifest, for example, as a deeper inhibitory penumbra around a point stimulus when resting activation levels are nonzero, and as more intense and sharply focused islands of activation when resting levels are zero. The intensity of peristimulus inhibition can be varied in this fashion regardless of whether competitive distribution of activation occurs in cortex alone, via thalamocortical connections alone, or both together. Besides varying gain constants (cp,c, c,) and peristimulus inhibition (v,9), one can vary the activation rule itself to assess further the robustness of the effects described above. When this is done, replacing equation 3.2 with dai/dt = ini + csai, then quasilinear8 versions of Model 8”Quasilinear” rather than ”linear” is used for Model CL because connection strengths ci, are a function of ai.

308

James A. Reggia et al.

C and I are obtained and are designated CL and IL. Activation levels in Models CL and IL are potentially unbounded. However, if c < 0, where c = c, + cp + c, is the sum of the gain parameters, then total activation for Models CLand IL will be bounded (Reggia and Edwards 1990). This constraint is thus enforced in all simulations with these models. When parameter values within these ranges are used, interelement excitatory and inhibitory relationships derived analytically are the same as those described above for Model C. However, when Models CL and IL have the same c value, c, is substantially more negative with Model CL than with Model IL. In other words, Model CL requires stronger intraelement or self-inhibition to function well than does Model IL. This feature of Model CL is consistent with the existence of numerous vertical (and therefore intraelement) inhibitory connections in neocortex. Repeating all of the experiments with Model C described above with Model CL produces qualitatively similar results. As gain parameters (cp, cS), competition parameters (v,9), resting activation levels, location of competitive distribution of output (intracortical vs. thalamocortical vs. both), and input patterns are altered, simulations with Model CL demonstrate similar results to those seen with Model C. Thus, peristimulus inhibition is relatively robust, and is not dependent on the exact form of the activation rule used. Finally, for each variation of Model C (or CL) a matching variation of Model I (or IL) with the same input patterns serves as a "control." Recall that Model I has the same network and activation rule as Model C except Model I uses lateral intracortical inhibitory connections rather than competitive distribution of activation. Each version of Model I also always has the same c value, resting activation levels, initial activation levels, and criteria for simulation termination as the version of Model C for which it serves as a control. Further, gain parameters of Model I are set so that its response to a single point thalamic stimulus is as close as possible to the single-point response of the version of Model C to which it corresponds. For example, to create a version of Model I that corresponds to highly competitive versions of Model C (v large, 9 small), a relatively large negative value of c, is used. By proceeding in this fashion, it proves possible through a suitable choice of parameters and lateral inhibitory connections to produce a version of Model I that behaves qualitatively similar to each version of Model C. In other words, in general the results described above with Model C can also be produced by appropriate versions of Model I. Activation patterns seen with Model C generally do not uniquely indicate that competitive distribution is present rather than lateral inhibitory connections. Some apparent exceptions occur in variations of the models with very intense lateral inhibitory effects (v very large, 9 very small; or c, very large), but these appear with a range of parameter values that are presumably nonphysiologic.

Competitive Distribution Theory of Neocortical Dynamics

309

7 Discussion

This work has demonstrated that competitive distribution of activation can replicate many aspects of peristimulus inhibition seen in neocortex. Inhibitory effects generally attributed solely to lateral inhibitory connections (e.g., Mexican Hat pattern of activation, island formation, edge enhancement) can also be explained by competitive distribution in the absence of such inhibitory connections. Further, the amount of peristimulus inhibition can be adjusted by varying the intensity of the underlying competitive process. These findings are robust with respect to moderate functional and parameter changes. They occur when competitive distribution of activation is used by thalamic elements alone, by cortical elements alone, or by both simultaneously. The latter observation implies that the two parts of the competitive distribution hypothesis must be confirmed or refuted individually. Given these results, how can one determine which possible mechanism, direct/indirect lateral inhibitory connections or competitive distribution, is actually responsible for a specific peristimulus inhibitory effect observed in neocortex? The simulations with Model I revealed that with a suitable selection of parameters and network connectivity, it is usually possible to duplicate qualitatively the activation patterns seen with Model C by using horizontal inhibitory connections. Thus, it currently appears unlikely that one could distinguish between these two mechanisms based solely on patterns of electrical or metabolic activity in neocortex. Their discrimination will apparently depend on obtaining better knowledge of complex intrinsic cortical and thalamocortical circuitry. In this context, it should be noted that morphological and physiological studies currently do not provide conclusive evidence that peristimulus inhibition in cortex is solely due to horizontal intracortical inhibitory circuitry. Careful electron micrographic studies have repeatedly demonstrated that the vast majority of horizontal connections in neocortex are excitatory (Fisken et al. 1973, 1975). While there certainly do exist inhibitory horizontal connections in neocortex, such as those from basket cells, they are apparently relatively sparse and at times their distribution does not match well with the distribution of peristimulus inhibitory effects. Thus, direct inhibitory horizontal connections (Fig. la) seem unlikely to explain all neocortical peristimulus inhibitory effects. Further, horizontal intracortical connections mostly terminate on spiny cells that are not immunoreactive for GABA (Gabbott ef al. 1987; Kisvarday ef al. 1986; LeVay 1988). This suggests that the primary role of intercolumnar horizontal connections is the activation of excitatory (pyramidal, spiny stellate) cells rather than inhibitory cells (Fig. lb,c). Such a conclusion is complemented by physiological studies that have failed to show lateral excitation of inhibitory neurons in regions of peristimulus inhibition (Hess et al. 1975) and by studies based on cross-correlation analysis that show that direct intercolumnar interactions are predominantly excita-

310

James A. Reggia et al.

tory (e.g., Ts’o et ul. 1986). While significant indirect lateral inhibition has recently been demonstrated in neocortex, it is polysynaptic and has been demonstrated only for longer (approximately 1 to 3 mm) distances (Hirsch and Gilbert 1991). Thus, although the data are currently limited and are subject to different interpretations, this at least raises concerns about accepting the prevailing view that direct/indirect horizontal inhibitory connections are the sole cause of peristimulus inhibitory effects in neocortex. These concerns are particularly striking when contrasted with the convincing case that is readily made for vertical (intracolumnar) inhibitory connections. Conversely, the same data that are difficult to explain with conventional views of intracortical inhibitory mechanisms are consistent with the competitive distribution hypothesis. Competitive distribution of activation does not involve either direct/indirect lateral inhibitory connections: as shown in this paper peristimulus inhibition arises in the context of purely excitatory horizontal connections. Further, as noted earlier, analysis of the dynamics of competitive distribution networks implies that they tend to require stronger self-inhibition of elements than do networks using lateral inhibitory connections (Reggia and Edwards 1990). This is consistent with the observation that most intracortical inhibitory connections are vertical and thus intracolumnar. This leads to the critical question of what one might look for empirically in terms of specific neural circuitry that can produce competitive distribution of activation. Given the gaps in our current knowledge, any answer to this question must be imprecise and speculative. A critical aspect of the competitive distribution hypothesis is a kind of “rich-getricher” phenomenon. When a cortical element is active, it tends to favor sending activation to neighboring elements that are already at least partially active, while decreasing output to neighboring elements that are inactive. Thus, a central feature of competitive distribution of activation is a logical AND operation: activation tends to spread from an activated element toward a neighboring element if and only if the sending element und the receiving element are both at least partially activated. This implies that if competitive distribution of activation is present intracortically, one would expect to find intrinsic neocortical circuitry that effectively performs a “leaky” AND operation in governing horizontal spread of activation. Current data appear to be inadequate to substantially support or refute this expectation (Douglas and Martin 1990; White 1989), although recent demonstration that partially depolarized pyramidal cells have strikingly augmented responses to excitatory inputs is consistent with an AND operation (Hirsch and Gilbert 1991). However, the neural circuitry mediating interactions between thalamus and cortex is better defined. Thalamic nuclei have traditionally been viewed as ”relay nuclei” that pass along afferent information largely unmodified to neocortex. However, it is difficult to relate this view of thalamus to the fact that corticothalamicconnections (excitatory connections

Competitive Distribution Theory of Neocortical Dynamics

311

from cortex back to thalamus) are actually more numerous than afferents to thalamus from the periphery (Sherman and Koch 1990). In this context, it is relevant that Model C implies the existence of topographically matched corticothalamic (cortex to thalamus) connections. In Model C, there are no explicit corticothalamic connections, but the retrograde influences described earlier imply the existence of such connections at the level of individual neurons. To understand this, recall that an element in Model C represents a set of neurons rather than a single neuron, and that a connection in Model C thus does not represent a single synaptic connection between two neurons. A connection in Model C represents a more complex set of structural and functional synaptic relations between the neurons in the two elements it joins. In particular, in deciding how much activation to send to cortical element k, thalamic element j uses connection strength ckj that is explicitly a function of cortical element k's activation ak (see equation 4.2). In reducing Model C's elements and connections to individual neurons and their synaptic connections, this implies that there must be synaptic connections from neurons in cortical element k to those in thalamic element j in order to communicate a k to the latter. In other words, Model C requires corticothalamic synaptic connections, and competitive distribution of activation is a potential reason for why such connections exist in the brain. Further, careful analysis of the relative positions of corticothalamic and afferent synapses on thalamic relay neuron dendrities, and of the nonlinear properties of NMDA receptors, has led to the suggestion that relay neurons in thalamus do serve as AND gates (Koch 1987). In other words, it has been proposed that a thalamic relay cell is activated in proportion to a logical AND operation involving its sensory input and its cortical input. Afferent information is thus routed through the thalamus and reaches the cortex to the extent that cortical activity matches the sensory input to the thalamus. Of course, this suggestion has previously left open the question of why, in a functional sense, corticothalamic connections act in this fashion. In the past researchers have speculated that this AND operation serves to selectively enhance interesting features in sensory input as an attentional mechanism. What we propose here is an alternate explanation for this thalamic AND operation: that it represents a biophysical manifestation of the competitive distribution hypothesis. In other words, the existence of corticothalamic connections that guide the forward competitive distribution of incoming sensory input via such an AND operation is predicted by the competitive distribution hypothesis. This argument can be generalized to account for the asymmetric nature of connections between hierarchically organized cortical regions such as those processing visual information (Felleman and Van Essen 1991). These cortical regions have "forward connections terminating predominantly in layer IV, "backward connections preferentially avoiding layer IV, and "lateral" connections terminating in all layers. Although not examined in our computational model, if the competitive distribution

312

James A. Reggia et al.

hypothesis is extended to include longer range cortex-to-cortex connections like these then it would explain the existence of backward connections differing from forward connections. Backward connections here are analogous to corticothalamic connections: they are required to guide the forward competitive distribution of activation and are again predicted to perform a logical AND operation. From this perspective lateral connections are between two cortical regions that both competitively distribute their activation to each other: this explains why they terminate in all cortical layers. While this interpretation is speculative, it does demonstrate that if the competitive distribution hypothesis is valid, it will provide us with a deeper understanding of a number of features of cortical connectivity. In addition to the AND operation, a second critical aspect of the competitive distribution hypothesis is the “normalization” of an element’s outputs (denominator,equation 4.1). We are currently investigating physiologically plausible neural circuitry that might implement such functionality, and briefly sketch one possible approach to intracortical normalization here as an example. With this approach, each element (column) contains a subpopulation of excitatory neurons that is activated solely by excitatory neurons in neighboring elements and not by thalamocortical afferents. The activity of this subpopulation of neurons thus represents a sum of neighboring elements’ activity (denominator, equation 4.11,and is used via intraelement inhibitory neurons to provide normalization of the element’s output. In other words, when this normalizing subpopulation is active, an element’s activity cannot spread to neighboring elements that are inactive (i.e., output to these less active neighbor elements is decreased). For this to work requires element outputs to pass through appropriate AND operations similar to those described above for thalamocortical interactions. Developing this and other approaches to plausible neural circuitry that can implement competitive distribution of activation is crucial to advancing the theory presented in this paper and is a focus of our ongoing research. If one accepts the possibility that competitive distribution of activation may cause some inhibitory effects in neocortex, then a natural question is what advantages this mechanism might have over using lateral/horizontal inhibitory connections exclusively. It has recently been suggested that a critical issue for neocortex is to minimize connections to limit its large volume and metabolic requirements (Cherniak 1990; Nelson and Bower 1990). Competitive distribution of activation may have an advantage in this regard: the numerous direct/indirect horizontal inhibitory connections required by traditional explanations of peristimulus inhibition are no longer needed. In addition, although not demonstrated here, some forms of competitive distribution of activation can produce activation patterns that appear difficult if not impossible to reproduce with lateral inhibitory connections. This has been observed in recent non-biological applications of competitive distribution in artificial intelli-

Competitive Distribution Theory of Neocortical Dynamics

313

gence and cognitive science (Reggia et al. 1991). These studies involved a set of elements receiving diverging connections from multiple sets of external elements, each with a different radius of divergence. Competitive distribution provided a cleaner blending of the various sized peristimulus inhibitory effects this implies when contrasted to lateral inhibitory connections. It is interesting to note in this context that each cortical element receives inputs from a variety of sources with different radii of divergence for each source (thalamic afferents, intrinsic horizontal cortical collaterals, callosal connections, ipsilateral corticocortical projections, etc.). Finally, the immediate rerouting of the flow of activation that would occur following a small cortical lesion may be important in nervous system response to injury.

Appendix I. To derive dini/dak for k # i for Model C, note that by equation 3.1, (7.1)

since cij = 0 for j E Ni(2) and j E Ti (recall c, = ct = 0 is assumed), and ei and bi are constants. A straightforward calculation based on equation 4.1 (noting dai/dak= 0 since i # k) gives (7.2)

Substituting equation 7.2 into equation 7.1 and some algebra leads to (7.3)

Now consider special cases of equation 7.3 when the distance r between cortical elements i and k is either r = 1, r = 2, or r > 3. When r = 1, that is when k is one of the six cortical elements adjacent to element i (see Fig. 3a), then for j E Ni(1) the value daj/dak = 0 except for j = k, where daj/dak = 1. Similarly, for j E Ni(l), dam/aak= 0 always for rn E Nj(1) except for the two elements j E &(I) = Ni(1) fi Nk(1) that are explictly labeled in Figure 3c. For these two elements j E Nik( 1) the value dam/dak= 0 for m E Nj(l) except when rn = k, where aam/aak= 1. Thus, for r = 1 where k E Ni(l), equation 7.3 becomes (7.4)

which simplifies precisely to equation 5.3 when u = 1.

314

James A. Reggia et al.

11. Equation 5.4 for r = 2 is derived in a completely analogous fashion to equation 5.3. There is no C,k term now since daj/dak = 0 for all j E Ni(1) because j # k since k $?! Ni(l), and the set of elements & ( I ) = Ni(l)flNk(l) are those elements explicitly labeled j in Figure 3d and e because element k is now r = 2 distance from element i. This gives (7.5) for r = 2, which simplifies to equation 5.4 when z, = 1. For r 2 3, Nik(1) is empty and &,/&k = 0 always, so by similar reasoning dini/dak = 0. 111. Now reconsider the case where r = 1 and k is adjacent to i (Fig. 3c). If a, = 0 for all elements except for ai and ak, which are nonzero, then as q -+ 0 we have Cik + cp by equation 4.1 and uj. = 0 for all j E Nik(1) in equation 5.3. In this case din,/& + c9, as stated in the text. On the other hand, if one of the elements labeled I in Figure 3c is sufficiently highly active, then the sum on the right-hand side of equation 7.4 can become negative making dinildak < 0. Finally, if all cortical elements have the same activation level a0 > 0, then for z, = 1, equation 7.4 simplifies to

which for 0 5 a0 5 M is always positive (recall nl = 6). Thus, if all nodes start at the same initial activation level, adjacent nodes excite one another. This fact plus the fact that dini/aak < 0 for r = 2 predicts the occurrence of a Mexican Hat response to a point stimulus in an initially quiescent network. IV.Now consider thalamic influences on element relationships. Assume that there are no intracortical connections at all (cp = c, = 0) so that equation 3.1 becomes (7.7) where ei and bi are constant and Ti is the set of thalamic elements that sends connections to cortical element i. Then (7.8) where k # i is a thalamic or cortical element. Suppose first that k is a thalamic element. Then for Model I, acv/dak = 0 always because cv = wq is constant, and daj/dak = 0 except when j = k, so dini/dak = Wik iff k E Ti. For Model C , a straightforward calculation shows that ac,/aa, = 0 also in this case (because dam/dak = 0 for all cortical elements rn, including rn = i), and again daj/dak = 0 except when j = k, so dini/dak = cik iff k E T I . Thus, for both Models I and C , din,/da, is positive and thalamic

Competitive Distribution Theory of Neocortical Dynamics

315

element k directly excites cortical element i iff k E Ti. This relation is static for Model I and dynamic for Model C . V. Suppose that k is a cortical element rather than a thalamic element and consider its relationship to cortical element i # k in the absence of any lateral intracortical connections (equation 7.7). Then for Model I, dcij/dak = 0 because cii is constant, and daj/dak= 0 always since j # k ( j is a thalamic element, k is a cortical element), so by equation 7.7, aini/aak = 0. Thus, as expected, cortical element k is neutral with respect to cortical element i in variants of Model I with no lateral intracortical connections. VI. The situation with a version of Model C having no lateral intracortical connections is different, however, because thalamic elements j distribute their activation competitively according to equation 4.2. Noting that dai/dak = 0 since i # k, a straightforward calculation from equation 4.2 gives (7.9) Noting that aua,/aak = 0 except for rn = k, and that rn = k is only possible for j E Ti n Tk, then substitution of equation 7.9 into equation 7.8 (where daj/aak is again zero) and some algebraic manipulation gives (7.10) Since all quantities in equation 7.10 are nonnegative, aini/dak here is nonpositive. In particular, if there are no thalamic elements that send connections to both cortical element i and k (i.e., TI n Tk = 41, then dini/& = 0 and elements i and k are neutral with respect to each other. On the other hand, if some thalamic elements send connections to both cortical elements i and k, then in general aini/dak< 0 and unlike Model I, elements i and k directly inhibit each other even in the absence of lateral intracortical connections. Note that equation 7.10 simplifies to equation 5.5 when v = 1 and where Tik = Ti n Tk. Acknowledgments Supported by NINDS Awards NS29414 and NS16332. The authors are with the Department of Computer Science, Department of Neurology, and Institute for Advanced Computer Studies. This paper benefited from discussions with John Donoghue and Chris Cherniak. References Benaim, M., and Samuelides, M. 1990. Dynamical properties of neural nets using competitive activation mechanisms. Proc. Int. Joint Conf. Neural Networks, IEEE, San Diego, 111, 541-546.

316

James A. Reggia et al.

Cherniak, C. 1990. The bounded brain: Toward quantitative neuroanatomy. 1. Cog. Neurosci. 2, 58-68. DAutrechy, C. L., Reggia J., Sutton, G., and Goodall, S. 1988. A general purpose simulation environment for developing connectionist models. Simulation 51, 5-19. Douglas, R., and Martin, K. 1990. Neocortex. In The Synaptic Organization of the Brain, G. Shepherd, ed., pp. 389-438. Oxford University Press, New York. Felleman, D., and Van Essen, D. 1991. Distributed hierarchical processing in primate cerebral cortex. In Cerebral Cortex 1, 1. Fisken, R., Garey, L., and Powell, T. 1973. Patterns of degeneration after intrinsic lesions of the visual cortex of the monkey. Brain Res. 53,.208-213. Fisken, R., Garey, L., and Powell, T. 1975. The intrinsic, association and commissural connections of area 17 of the visual cortex. Phil. Trans. R. Soc. B 272, 487-536. Freedman, H. 1980. DeterministicMathematical Models in Population Ecology. Marcel Dekker, New York. Gabbott, F., Martin, K., and Whitteridge, D. 1987. Connections between pyramidal neurons in layer 5 of cat visual cortex. 1.Comp. Neural. 259, 364-381. Gatter, K., Sloper, J., and Powell, T. 1978. An electron microscopic study of the termination of intracortical axons upon Betz cells in area 4 of the monkey. Brain 101, 543-553. Gilbert, C. 1985. Horizontal integration in the neocortex. Trends Neurosci. 8, 160-165. Grossberg, S. 1980. Biological competition: Decision rules, pattern formation and oscillations. Proc. Natl. Acad. Sci. U.S.A. 77, 2338-2342. Hess, R.,Negishi, K., and Creutzfeldt, 0. 1975. The horizontal spread of intracortical inhibition in the visual cortex. Exp. Brain Res. 22, 415-419. Hirsch, J., and Gilbert, C. 1991. Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci. 11, 1800-1809. Hirsch, M. 1989. Convergent activation dynamics in continuous time neural networks. Neuml Networks 2, 331-349. Houser, C., et al. 1984. GABA neurons in the cerebral cortex. In Cerebral Cortex, E. Jones and A. Peters, eds., Vol. 2, pp. 63-89. Plenum, New York. Jones, E., and Hendry, S. 1984. Basket cells. In Cerebral Cortex, E. Jones and A. Peters, eds., Vol. 1, p. 309. Plenum, New York. Juliano, S., Hand, P., and Whitsel, B. 1981. Patterns of increased metabolic activity in somatosensory cortex of monkeys. 1. Neurophys. 46, 1260-1284. Kisvarday, Z., Martin, K., Freund, T., Magloczky, Z., Whitteridge, D., and Somogyi, P. 1986. Synaptic targets of HRP-filled layer I11 pyramidal cells in cat striate cortex. Exp. Brain Res. 64, 541-552. Koch, C. 1987. The action of the corticofugal pathway on sensory thalamic nuclei: A hypothesis. Neuroscience 23, 399-406. Kohonen, T. 1984. Self-Organization and Associative Memory. Springer-Verlag, Berlin. LeVay, S. 1988. Patchy intrinsic projections in visual cortex, area 18, of the cat: Morphological and immunocytochemical evidence for excitatory function. 1. Cornp. NeuroI. 269, 265-274.

Competitive Distribution Theory of Neocortical Dynamics

317

Lotka, A. 1924. Elements of Physical Biology. Williams & Wilkens, Baltimore. Mountcastle, V. 1978. An organizing principle for cerebral function. In The Mindful Brain, G. Edelman and V. Mountcastle, eds., pp. 1-50. MIT Press, Cambridge, MA. Nelson, M., and Bower, J. 1990. Brain maps and parallel computers. Trends Neurosci. 13, 403-408. Pearson, J., Finkel, L., and Edelman, G. 1987. Plasticity in the organization of adult cerebral cortical maps: Computer simulation based on neuronal group selection. J. Neurosci. 7, 4209-4223. Reggia, J., and Edwards, M. 1990. Phase transitions in connectionist models having rapidly varying connection strengths. Neural Comp. 2, 523-535. Reggia, J., Peng, Y.,and Bourret, P. 1991. Recent applications of competitive activation mechanisms. In Neural Networks: Advances and Applications, E. Gelenbe, ed., pp. 33-62. North-Holland, Amsterdam. Sejnowski, T., Koch, C., and Churchland, I? 1988. Computational neuroscience. Science 241, 1299-1306. Sherman, S., and Koch, C. 1990. Thalamus. In The Synaptic Organization of the Brain, G. Shepherd, ed., pp. 246-278. Oxford University Press, New York. Somogyi, P.,Cowey, A., Halasz, N., and Freund, T. 1981. Vertical organization of neurons accumulating 3H-GABA in visual cortex of monkey. Nature (London) 294, 761-763. Ts'o, D., Gilbert, C., and Wiesel, T. 1986. Relationships between horizontal interactions and functional architecture in cat striate cortex by cross-correlation analysis. 1.Neurosci. 6, 1160-1170. Von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. White, E. 1989. Cortical Circuits. Birkhauser, Boston.

Received 10 June 1991; accepted 28 October 1991.

This article has been cited by: 1. Reiner Schulz , James A. Reggia . 2005. Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic ChangesMirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes. Neural Computation 17:5, 1059-1083. [Abstract] [PDF] [PDF Plus] 2. Reiner Schulz, James A. Reggia. 2004. Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing MapsTemporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps. Neural Computation 16:3, 535-561. [Abstract] [PDF] [PDF Plus] 3. Svetlana Levitan , James A. Reggia . 2000. A Computational Model of Lateralization and Asymmetries in Cortical MapsA Computational Model of Lateralization and Asymmetries in Cortical Maps. Neural Computation 12:9, 2037-2062. [Abstract] [PDF] [PDF Plus] 4. E. Erwin, K. Obermayer, K. Schulten. 1995. Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical ComparisonModels of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison. Neural Computation 7:3, 425-468. [Abstract] [PDF] [PDF Plus] 5. J. G. Taylor, F. N. Alavi. 1995. A global competitive neural network. Biological Cybernetics 72:3, 233-248. [CrossRef] 6. Granger G. Sutton III , James A. Reggia , Steven L. Armentrout , C. Lynne D'Autrechy . 1994. Cortical Map Reorganization as a Competitive ProcessCortical Map Reorganization as a Competitive Process. Neural Computation 6:1, 1-13. [Abstract] [PDF] [PDF Plus] 7. Sungzoon Cho , James A. Reggia . 1993. Learning Competition and CooperationLearning Competition and Cooperation. Neural Computation 5:2, 242-259. [Abstract] [PDF] [PDF Plus]

Communicated by Christof Koch

A Network Simulation of Thalamic Circuit Operations in Selective Attention David LaBerge Marc Carter Vincent Brown Department of Cognitive Science, University of California, Irvine, C A 92712 U S A

The ability of a thalamic circuit to process information selectively from a spatial location was investigated in neural network models. Starting with the known general structure of the thalamic circuit, we considered three variations of the projections from the inhibitory cells of the reticular nucleus onto the cells of the pulvinar nucleus of the dorsal thalamus. The three circuits were modeled as systems of difference equations, and their operations were simulated by computer-based numerical integration. In all three circuits, when input from a target location was slightly larger than the input from neighboring locations, the time evolution of principal (relay) cell outputs showed substantial selective enhancement at the target location compared with neighboring locations. The selective enhancement effect was produced not only on ascending inputs but also on descending cortical inputs. Simulations separating the lateral inhibitory and feedback-enhancement components of the circuits suggested that the feedback-enhancement component substantially magnified the ability of lateral inhibition to produce a targethurround difference. 1 Introduction

Visual processing begins at the layer of photoreceptors in the retina and proceeds through the brain structures of the lateral geniculate nucleus (LGN), striate cortex (Vl), and into extrastriate cortex, where modules mediate judgments of objects in the stimulus array (see Fig. 1). Examples of typical judgments computed from the stimulus array are the location of a shape, the direction and velocity of its movement, its color, orientation, and depth, and, if it is a familiar shape, its identity. Typically the visual scene is cluttered with objects, and if a particular property of a target object is to be effectively processed by a judgment module the information arising from the location of the target object must somehow be selected from the information arising from the locations of the other objects. For example, if a subject is asked to identify the center letter of Neural Computation 4,318-331 (1992) @ 1992 Massachusetts Institute of Technology

Network Simulation of Thalamic Circuit Operations

319

n

Figure 1: Cortical and thalamic areas presumed to be involved in the identification of an object in a cluttered field (e.g., the " 0in "HOT"). Boxes in the figure are adapted from the Desimone and Ungerleider (1989) schematic of monkey visual cortical areas. PP,posterior parietal; IT, inferior temporal; DLPFC, dorsolateral prefrontal cortex; Pul, pulvinar. The operation of selective enhancement of cortical activity is assumed to be produced by circuitry linking the lateral pulvinar to V4, with afferent inputs to the circuitry from PJ? the three-letter stimulus "HOT," the information from the surrounding H and T locations must be decreased or eliminated to allow identification of the " 0alone instead of the entire word "HOT." The mechanism that selects information at spatial locations has been likened to a "searchlight" or "spotlight" (e.g., Crick 1984; Julesz 1984; Posner 1980; Treisman and Gelade 19801, a "zoom lens" (Eriksen and St. James 1986), or a "filter" (e.g., Broadbent 1958; LaBerge and Brown 1989; Treisman 1964). The brain system or systems that select visual locations could do so in one of three ways: the information flow corresponding to the location of the target object could be enhanced, the information flow corresponding to the locations of the surrounding objects could be diminished, or both. All three methods of selection require that the activity corresponding to the target location be greater than that corresponding to surrounding locations. Because of their central location in the brain, their reciprocal connections with cortex, and the presence of the reticular nucleus with its lateral inhibitory properties, nuclei of the thalamus serve as candidates for a selection mechanism (Ahlsen et al. 1985; Crick, 1984, LaBerge and Brown 1989; LaBerge and Buchsbaum 1990; Sherman and Koch 1986; Scheibel

320

D. LaBerge, M. Carter, and V. Brown

1981; Singer 1977). In particular, recent neurobiological findings in primates and humans suggest that the pulvinar nucleus of the thalamus participates in visual selection of an object in a cluttered field (Desimone et al. 1990; LaBerge and Buchsbaum 1990). The pulvinar nucleus occupies the posterior 2/5 of the human (dorsal) thalamus, and is partitioned into four subnuclei: the anterior, medial, lateral, and inferior nuclei (Jones 1985). In the primate, these structures, particularly the medial (Dick et al. 1991) and lateral (Benevento and Davis 1977; Ungerleider et al. 1983)nuclei, are reciprocally connected to cortical areas in the pathway extending from the occipital lobe to the temporal lobe, and the connections apparently preserve visuotopic mappings. The medial and lateral pulvinar also connect reciprocally with the posterior parietal areas (Dick et al. 1991; Jones 1985), and the medial pulvinar also connects reciprocally with prefrontal cortical areas (Asanuma et al. 1985; Goldman-Rakic and Porrino 1985). Thus the pulvinar is in a particularly advantageous position to mediate influences of anterior cortical areas and posterior parietal areas on sensory processing in the occipital-temporal pathways (see Fig. 1). Aside from these cortical areas, afferent inputs to the pulvinar nuclei arise from the superior colliculus and pretectum, and apparently sparsely, if at all, from the retina. Most of pulvinar processing, then, is exerted on information arising from cortical areas. In general, thalamic circuits exhibit a topographical structure, given that their inputs project with varying degrees of topographical precision through the principal neurons onto cortical columns (Jones 1985; Steriade et al. 1990). The principal (relay) neurons of these thalamic nuclei could be said to be arranged in columns, since they do not connect with each other. Monosynaptic thalamocortical input to layer 6 neurons is apparently returned to principal thalamic neurons in a thalamocorticothalamic loop (Steriade et al. 1991). Both thalamocortical and corticothalamic axons course through the reticular nucleus, which is a thin sheet of neurons that partially surrounds the thalamus, and collaterals of the axons of the principal and corticothalamic neurons synapse with the cells of the reticular nucleus. The reticular nucleus cells inhibit each other and they also inhibit principal cells (Jones 1985; Steriade et al. 1990) as well as thalamic interneurons that themselves inhibit principal cells (Ahlsen et al. 1985; Steriade et al. 1991). Most of the detailed knowledge of thalamic circuitry has been obtained from studies of the LGN and the ventroposterior lateral (VPL) sensory nuclei. Important details of pulvinar circuitry have yet to be described by anatomical labeling studies. One question is whether a reticular nucleus (RN) axon connects with the principal (P) in its own column or with P neurons in neighboring columns, or both, because an RN axon can apparently distribute its boutons widely within a nucleus (Yen etal. 1985). Another question is whether the low percentage (approximately 8%) of RN to interneuron (I) synapses in monkey LGN (Takacs

Network Simulation of Thalamic Circuit Operations

321

et al. 1991) carries over to the pulvinar. The answers to these questions about pulvinar circuitry are very likely to affect the way we infer that the pulvinar processes its input signals. In the absence of the required neuroanatomical data, we can construct a set of models that is intended to span the range of anatomical possibilities.

2 Models of the Pulvinar Circuit

A sample of three of the possible combinations of connections from RN

neurons to P and I neurons is shown in Figure 2. Each model contains five types of neurons organized in columns: Afferents (A) provide ascending input to the circuits from cortical areas (e.g., striate, extrastriate, or prefrontal areas) other than the cortical area to which a given pulvinar circuit projects; principal neurons (P) receive afferent input and in turn project to cortical (C) and reticular nucleus (RN) neurons; cortical neurons return projections to the principal neurons; interneurons (I) directly inhibit principal neurons; reticular nucleus neurons are excited by both principal and cortical axons, but their pattern of recurrent projections to principal neurons and interneurons is currently unknown, and is the aspect of the circuitry that is varied across the three models. Model A could be considered the most conservative model, given what is presently known concerning the RN to P connections in thalamic nuclei, because an RN axon here projects unselectively to its own column as well as to neighboring columns. Model B assumes that RN axons project only to neighboring columns, in the manner described by Sherman and Koch (1986). Model C assumes that RN axons project to I cells in their own columns, as well as to P cells in their own columns (Steriade et al. 1991; LaBerge 1990). The network models of the pulvinar circuit shown in Figure 2 contain two vertically organized columns of cells. Each column is intended to represent a cluster of cells, so that the left column corresponds to information arising from a target location in the visual field, while the right column corresponds to information arising from neighboring locations on each side of the target. We will refer to the left column of a network in Figure 2 as the target pathway and the right column as a flanker pathway. In a corresponding behavioral task, the subject is shown the stimulus HOT and asked to identify the middle letter as the target letter. It is assumed that the information arising from the middle location of the stimulus HOT is represented by signals flowing in the left column of the circuit models, while information arising from the left and right locations of the stimulus is represented by signals flowing in the right column of the circuit models. The abilities of Models A and B to increase the difference in activity between target and surround columns is observed by noting that the

D. LaBerge, M. Carter, and V. Brown

322

A

R

6; ,.....

@'

e....,.

_-...."..a ,,

:

Figure 2: Network models representing three variations of pulvinar circuits based on hypothetical projection patterns of reticular nucleus cells to principal cells and interneurons. Abbreviations of neurons are C, cortical; P, principal (relay); RN, reticular nucleus; I, interneuron; A, afferent (from another cortical area or from the superior colliculus). The left column represents the spatial location of the target stimulus and the right column the spatial location of a neighboring (flanker) stimulus. Solid lines represent excitatory connections and dashed lines inhibitory connections.

circuits contain recurrent lateral inhibitory connections (e.g., see Cornsweet 1970; Hartline 1974). Each column in the circuit mutually inhibits its neighbor, since the principal thalamic and cortical cells synapse on the inhibitory RN cells, which in turn synapse on neighboring principal cells. The within-column feedback from cortical cells to principal cells

Network Simulation of Thalamic Circuit Operations

323

serves to enhance the target surround differences by allowing the level of activity in the target cell to be increased above its initial level (see Grossberg 1980, for a treatment of this kind of circuit), an effect not shown in standard lateral inhibitory mechanisms. Model C functions somewhat differently. Here the l7N cells serve to disinhibit principal cells by inhibiting the tonically inhibitory interneurons. This creates in effect a positive feedback loop from the cortical cells to the principal cells. The mutually inhibitory connections between the RN cells generates a lateral inhibitory effect by reducing the disinhibition in neighboring columns. The main purpose of this study is to provide explicit demonstrations of selective processing by location in these model pulvinar circuits, and to observe whether there are any noteworthy differences in the abilities of the three models to instantiate this function. 3 Simulation Results

The equations underlying the simulation of the operations of the network models and a description of parameter settings are given in the Appendix. The solution of the set of coupled difference equations representing the updating of the state of each neuron unit was obtained by numerical integration. The performance of the network models under the condition in which the afferent input values for the target and flanker were only slightly different is shown in Figure 3. The trajectories of the principal cells and cortical cells begin with an increase in activity at both target and flanker locations, and initially the increase is marked by oscillations, owing mainly to the interactions arising from reticular nucleus units. Eventually, the activities of the target and flanker locations stabilize at values that differ much more than the initial difference at the afferent inputs. The trajectories of the cortical cells are quite similar to the trajectories of the principal cells. In effect, a slight selective advantage of the target location at the input to the network is augmented considerably and sustained at the output. The slight initial advantage of the target location over that of the flankers is assumed to be induced by higher order (i.e., prefrontal) processes that influence afferent input to the principal cells of the pulvinar via parietal cortex, because task instructions determine which item of the display is to be identified or judged. This input from parietal cortex is carried by the neurons labeled "A" in Figure 2. The network then increases the effect of this top-down selection by augmenting the difference between the activity at the target and flanker locations, and then projects this larger difference onto the flow of information from occipital to temporal cortex. Evidence that prestriate-driven excitation and inhibition in neighboring cells of the pulvinar can modulate processing in extrastriate

D. LaBerge, M. Carter, and V. Brown

324

'"1

Model A

'7

C"0l..

'7

Model A

C"OI..

Model B

1

Model B

Irn

Cpl..

'"1

Model C Irn

1

Model C

Figure 3: Time evolution of the output of the principal (relay) cells (left side) and cortical cells (right side) corresponding to a target location (solid line) and a flanker location (dashed line) modeled by the three networks. Inputs ascended from the afferent end of the column circuits. The afferent input values for all models were target, 38; flanker, 37; intervening space, 5. (Initial cortical values were set at threshold, i.e., at 10.)

cortex is given by Kato (1990). Strictly speaking, the thalamic network does not itself act as an agent of selection, but rather operates as a gain on information sent to it from other brain regions.

Network Simulation of Thalamic Circuit Operations

325

The curves in Figure 3 show that the effects of the networks on afferent input patterns consist both of facilitation of the target and inhibition of the flanker locations, yielding a net difference that represents the selection outcome. To examine separately the roles of enhancement and inhibition in the network, we ran a simulation in which enhancement was removed (by setting the corticothalamic weight to zero) in order to assess the effect of the lateral inhibitory connections alone. We also ran a simulation with inhibition removed (by setting the RN-to-RN weights to zero or, in the cases of Models A and B, also setting the RN-to-P weights to zero) in order to assess the effect of enhancement alone. When the inputs of the target and flankers differ only slightly, it appears that inhibition without enhancement produces little increase in advantage for the target location relative to that of the flankers. Likewise, enhancement without inhibition produces little, if any, increase in relative activation. Apparently both inhibition and enhancement components of the network are needed to obtain the substantial activation advantage shown in Figures 3 and 4. Thus far we have used as inputs to the present networks an activity differential at the ascending afferent end of the circuit. One might expect that an input to the circuit can also originate from the descending cortical end of the circuit, because transcortical axons can increase activity at cortical columns. This case was simulated in the network models by setting the afferent initial activations to zero and assigning initial activation values of 28 and 27 to target and surrounding cortical cells, respectively. The results of these simulations for outputs of principal cells and cortical cells are shown in Figure 4. The output trajectories for cortical cell inputs are similar. It appears that the thalamic circuit of all three models can selectively enhance initial cortical differences between activity at a target location and its surround. A possible implication of this result is that the functions of the thalamic nuclei could include not only the "relaying" of information from afferent inputs to cortex, but also the selective self-enhancement of activity in local cortical columns. Changes in the gain of the thalamocorticothalamic loop (increasing or decreasing the feedback from the cortical to principal cells by 0.20 activation units) decreased the differences between target and flanker asymptotic outputs for Models A and B when the input ascended from the afferent end of the circuit. When the input descended from the cortical end of the circuit, increasing the gain produced only slight increases in the asymptotic differences for Models A and B that decreased at a gain increment of 0.40 units; decreasing the gain by 0.20 units decreased the asymptotic differences for these models. For both ascending and descending inputs of Model C , increasing the gain by 0.10 units decreased the asymptotic difference, and decreasing the gain by 0.20 units also decreased the asymptotic difference.

D. LaBerge, M. Carter, and V. Brown

326

'- 1

Model A

i!!z Model A

'"1

i-k- i ,...........................

;~

............................

C"d..

qC.1..

Irn

1

Model 0

': Model 0

#.\<

............................

cys1..

1

Model C

'" 1

Model C

Irn

Figure 4: Time evolution of the output of the principal (relay) cells (left side) and cortical cells (right side) corresponding to a target location (solid line) and a flanker location (dashed line) modeled by the three networks. Inputs descended from the cortical end of the column circuits. The cortical input values for all models were target, 28; flanker, 27; intervening space, 27. (Afferent input values were set to zero.)

4 Conclusion Three different models of a pulvinar circuit showed very similar selective enhancement of the information originating from a target object sur-

Network Simulation of Thalamic Circuit Operations

327

rounded by other objects, suggesting that the pulvinar could enhance information flow at a specified spatial location. The high rate of gain exerted on the initially small input differences between target and surround locations apparently required the participation of corticothalamic feedback along with the lateral inhibition component(s) of the circuit. The relative enhancement of a target location with respect to its surround was also produced when the sole input to the circuit descended from the cortical cells of a column. This finding suggests that thalamic nuclei other than pulvinar and sensory nuclei could selectively enhance initially small increments in cortical areas produced by transcortical connections.

Appendix The activation of the ith unit at time t + 1 is expressed as

Ai(t + 1) = ui + aiAi(t)+ CwjiV;(t),

A;(O)= U ;

i

and the output of the ith unit at time t

+ 1 is expressed as

+

Vi(t 1) = Gi[Ai(t+ l)], where 0

Ai(t) is the activation of the ith unit at time t;

0

ui

0

a, is the rate of decay of activation of the ith unit;

0

w;i is the weight of the connection from unit j to i;

0

Vi(t)is the output of the ith unit at time t.

0

G,[ ] is the exponential threshold function of unit i, defined as

is the base activation of the ith unit;

- exp[-k(x 0, otherwise;

Gi[x] = Mi{l =

-

Bi)]),

0

Mi is the maximum output of the ith unit;

0

k is the slope of the exponential function;

0 Bi

when x > Bi,

is the threshold of the ith unit.

Since RN neurons mutually inhibit each other both through dendrodendritic and axosomatic connections, we make an additional as-

D. LaBerge, M. Carter, and V. Brown

328

sumption concerning their interactions. Specifically, we assume that for columns h and k,

Vk(t$.1)

=

Vh(f)- Vk(t)l

=

0, otherwise.

when

vh

> Vk,

This equation states that the input to the RN unit in column h considers the output of the RN unit in column k only when column k's RN output is greater than h's. In effect, this assumes a winner-take-all relationship between the RN units, with the update of the RN having the lower activation based on the difference between the output of the two units. The weights from RN units to neighboring RN units and principal cells decreased exponentially with the distance between their columns. The values of the base activations ui and the connection weights wji are given in Table 1. The activation decay rates ai are set at 0.05 for cortical and principal units and 0.0 for the others. The maximum output Mi was set at 100 for each unit, and the threshold Bi was set at 10 for all units. The exponential slope k was 0.0128. The parameters for Model C were estimated with the assistance of a hill-climbing program based on an integration of the differences between target and flanker firing rates attained in the shortest time. For purposes of comparison, certain parameter values were held constant across the three models. Therefore, for Models A and B, further adjustment of the parameters by hand was employed to approximate an optimization of the criterion. However, it should be noted that the obtained parameter values are probably not optimal with respect to the size of the target/flanker output difference and the speed with which this difference is achieved. In the present simulations, the size and location of an object is represented by five adjacent columns, with three columns separating a pair of objects in the case of ascending afferent input, and no separation between columns in the case of descending cortical input. Thus a stimulus display of three horizontally positioned objects had a width corresponding to 21 columns in the network space. The graphic output of the network shown in Figures 3 and 4 is the average output activity of the principal (or cortical) cell of the columns corresponding to the targets and flankers. The total number of columns in each simulation was 23 (including an empty column at each end), and the number of cells in each column was 7 (one each of A, P, and C, and two each of RN and I), for total of 161 units. Simulations of networks in which each column contained one or four RN (and I) units produced patterns of outputs that were similar to the patterns obtained here. Acknowledgment The preparation of this paper was supported by ONR Grant N00014-88K-0088 to the first author.

329

Network Simulation of Thalamic Circuit Operations Table 1: Parameters for the Three Network Simulations.

Starting activations: ui

Connection weights: wji C R N Cortical RN Principal Inter Afferent

P

I

1.70 0.00 -0.70 0.45 0.00 0.00 -0.40 0.00 1.00

0.10 0.00 0.00 0.00 0.20

0.00 0.24 0.00 1.70

0.00 0.00

0.00 0.24

1.20

Cortical RN Principal Inter Afferent

0.00 0.00 0.00 0.00 -0.40 0.00 0.00 1.00

Cortical

0.00 0.24

0.00 0.00 1.70 0.45

RN

1.00

0.00 0.00 -0.02 Principal 1.70 0.45 0.00 Inter 0.00 0.00 -0.40 Afferent 0.00 0.00 1.00

A

C R N P I

A

Model A 0.00 Target 10.0 10.0 10.0 15.0 38.0 0.00 Flanker 10.0 10.0 10.0 15.0 37.0 0.00 Space 10.0 10.0 10.0 15.0 5.0 0.00

0.00

Model B Target 10.0 10.0 10.0 15.0 38.0 0.00 0.00 Flanker 10.0 10.0 10.0 15.0 37.0 0.00 0.00 Space 10.0 10.0 10.0 15.0 5.0 0.00 0.00 0.20 0.00 0.10 0.00

Model C 0.00 Target 10.0 30.0 10.0 47.0 38.0 0.00 Flanker 10.0 30.0 10.0 47.0 37.0 0.00 Space 10.0 30.0 10.0 47.0 5.0 0.00

0.10 -0.79 0.00 0.00 0.40

0.00

Between-column weights Model A Model B Model C

RN-RN RN-P Decay (columnto-column) Radius of RN influence

-1.500 -0.700 0.985

-1.500 -0.300 0.985

-1.500 0.000 0.985

8

8

8

References Ahlsen, G., Lindstrom, S., and Lo, F. S. 1985. Interaction between inhibitory pathways to principal cells in the lateral geniculate nucleus of the cat. Exp. Brain Res. 58, 134-143. Asanuma, C., Andersen, R. A., and Cowan, W. M. 1985. The thalamic relations of the caudal inferior parietal lobule and the lateral prefrontal cortex in

330

D. LaBerge, M. Carter, and V. Brown

monkeys: Divergent cortical projections from cell clusters in the medial pulvinar nucleus. J. Comp. Neurol. 241, 357-381. Benevento, L. A., and Davis, B. 1977. Topographical projections of the prestriate cortex to the pulvinar nuclei in the macaque monkey: An autoradiographic study. Exp. Brain Res. 30, 405-424. Broadbent, D. E. 1958. Perception and Communication. Pergamon Press, London. Cornsweet, T. N. 1970. Visual Perception. Academic Press, New York. Crick, F. 1984. The function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. U.S.A. 81, 4586-4590. Desimone, R., and Ungerleider, L. 1989. Neural mechanisms of visual processing in monkeys. In Handbook of Neuropsychology, Vol. 2, E. Boller and J. Grafman, eds., pp. 267-299. Elsevier, Amsterdam. Desimone, R., Wessinger, M., Thomas, L., and Schneider, W. 1990. Attentional control of visual perception: Cortical and subcortical mechanisms. Cold Spring Harbor Symp. Quant. Biol. 55, 963-971. Dick, A., Kaska, A., and Creutzfeldt, 0. D. 1991. Topographical and topological organization of the thalamocortical projection to the striate and prestriate cortex in the marmoset. Exp. Brain Res. 84, 233-253. Eriksen, C. W., and St. James, J. D. 1986. Visual attention within and around the field of focal attention: A zoom lens model. Percept. Psychophys. 40,225-240. Goldman-Rakic, P S., and Porrino, L. J. 1985. The primate mediodorsal (MD) nucleus and its projections to the frontal lobe. 1. Comp. Neurol. 242, 535-560. Grossberg, S. 1980. How does a brain build a cognitive code? Psychol. Rev. 87, 1-51. Hartline, H. K. 1974. In Studies on Excitation and Inhibition in the Retina, F. Ratliff, ed. Rockefeller University Press, New York. Jones, E. G. 1985. The Thalamus. Plenum Press, New York. Julesz, B. 1984. Toward an axiomatic theory of preattentive vision. In Dynamic Aspects of Neocortical Function, G. M. Edelman, W. E. Gall, and W. M. Cowan, eds., pp. 585-610. Wiley, New York. Kato, N. 1990. Cortico-thalamo-cortical projection between visual cortices. Brain Res. 509, 150-152. LaBerge, D. 1990. Thalamic and cortical mechanisms of attention suggested by recent positron emission tomographic experiments. J. Cog. Neurosci. 2, 358-372. LaBerge, D., and Brown, V. 1989. Theory of attentional operations in shape identification. Psychol. Rev. 96, 101-124. LaBerge, D., Brown, V., and Carter, M. 1990. Selective cortical enhancement effects produced by a thalamic circuit model based on current neuroanatomical findings. SOC.Neurosci. Abst. 16, 579. LaBerge, D., and Buchsbaum, M. S. 1990. Positron emission tomographic measurements of pulvinar activity during an attention task. 1. Neurosci. 10, 613419. Posner, M. I. 1980. Orienting of attention: The VIIth Sir Frederic Bartlett Lecture. Q. 1.EXP.Psych. 32, 3-25. Scheibel, A. B. 1981. The problem of selective attention: A possible structural

Network Simulation of Thalamic Circuit Operations

331

substrate. In Brain Mechanisms and Perceptual Awareness, 0.Pompeiano and C. Marsen, eds. Raven Press, New York. Sherman, S. M., and Koch, C. 1986. The control of retinogeniculate transmission in the mammalian lateral geniculate nucleus. Exp. Brain Res. 63, 1-20. Singer, W. 1977. Control of thalamic transmission by corticofugal and ascending reticular pathways in the visual system. Physiol. Rev. 57, 386420. Steriade, M., Jones, E. G., and Llinas, R. R. 1990. Thalamic Oscillations and Signaling. Wiley, New York. Steriade, M., Pare, D., Hu, B., and Deschenes, M. 1991. The visual thalamocortical system and its modulation by the brain stem core. In Progress in Sensory PhysiologylO, H. Autrum, D. Ottoson, E. R. Perl, R. F. Schmidt, H. Shimazu, and W. D. Willis, eds. Springer-Verlag, Berlin. Takacs, J., Hamori, J., and Silakov, V. 1991. GABA-containing neuronal processes in normal and cortically deafferented dorsal lateral geniculate nucleus of the cat: An immunogold and quantitative EM study. Exp. Brain Res. 83,562-574. Treisman, A. M. 1964. Selective attention in man. Br. Med. Bull., 20, 649-742. Treisman, A., and Gelade, F. 1980. A feature integration theory of attention. Cog. Psychol. 12, 97-136. Ungerleider, L. G., Galkin, R. W., and Mishkin, M. 1983. Visuotopic organization of projections from striate cortex to inferior and lateral pulvinar in rhesus monkey. I. Comp. Neurol. 217,137-157. Yen, C. T., Conley, M., Hendry, S. H. C., and Jones, E. G. 1985. The morphology of physiologically identified GABAergic neurons in the somatic sensory part of the thalamic reticular nucleus in the cat. 1. Neurosci. 5, 2254-2268.

Received 6 June 1991; accepted 7 October 1991.

This article has been cited by: 1. G. Pfurtscheller. 2003. Induced Oscillations in the Alpha Band: Functional Meaning. Epilepsia 44:s12, 2-8. [CrossRef] 2. Alan A. Hartley, Nicole K. Speer. 2000. Locating and fractionating working memory using functional neuroimaging: Storage, maintenance, and executive functions. Microscopy Research and Technique 51:1, 45-53. [CrossRef] 3. J. G. Taylor, F. N. Alavi. 1995. A global competitive neural network. Biological Cybernetics 72:3, 233-248. [CrossRef]

Communicated by A. 8. Bonds

Generation of Direction Selectivity by Isotropic Intracortical Connections Florentin Worgotter Institut fiir Physiologie, Ruhr-Universitat Bochum, W-4630 Bochum, Germany

Ernst Niebur Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125 USA

To what extent do the mechanisms generating different receptive field properties of neurons depend on each other? We investigated this question theoretically within the context of orientation and direction tuning of simple cells in the mammalian visual cortex. In our model a cortical cell of the "simple" type receives its orientation tuning by afferent convergence of aligned receptive fields of the lateral geniculate nucleus (Hubel and Wiesel 1962). We sharpen this orientation bias by postulating a special type of radially symmetric long-range lateral inhibition called circular inhibition. Surprisingly, this isotropic mechanism leads to the emergence of a strong bias for the direction of motion of a bar. We show that this directional anisotropy is neither caused by the probabilistic nature of the connections nor is it a consequence of the specific columnar structure chosen but that it is an inherent feature of the architecture of visual cortex. All of the response properties of cortical cells can in principle be explained with models that postulate a high degree of connection specificity Establishing highly specific connections requires, however, large amounts of information, which might be more than what can be determined genetically or learned during development. Thus, it becomes of interest to investigate the extent to which unspecific mechanisms, which can be established with minimal information requirements, underlie the observed receptive field properties. We constructed over the last years a detailed model of the connectivity in a small patch in layer IV of cat visual cortex in order to investigate the mechanisms underlying orientation selectivity. The model of a 5" x 5" patch of the primary visual pathway of cat includes a total of more than 16,000 cells in the ON and OFF subsystems of retina and LGN and simple Neural Computation 4,332-340 (1992) @ 1992 Massachusetts Institute of Technology

Generation of Direction Selectivity

333

cells in layer IV in area 17. The retina is stimulated with moving light bars. Cells are modeled as improved integrate-and-fire neurons. Our single-cell model takes into account absolute and relative refractory periods and the after-hyperpolarization following a spike, but it stops short of the fine details that can be found, for instance, in the Hodgkin-Huxley model. Realistic convergence and divergence numbers between cell populations are implemented using more than 2,000,000 synapses (Orban 1984). Each cortical cell receives input from a field of (on average) 5 x 13 LGN cells. This convergence of LGN receptive fields leads to an initial orientation bias with a cortical receptive field elongation of 1.7. The preferred orientation of the cortical receptive fields is changed continuously across the modeled patch with a periodicity (hypercolumn width A) of 1". Receptive fields are modeled with realistic scatter and jitter (Albus 1975) and axonal delays were set to realistic values (Hoffmann etal. 1972). All parameters are subject to statistical fluctuations (Wehmeier et al. 1989; Worgotter and Koch 1991). Figure 1B shows the average orientation and direction tuning curve of 55 randomly chosen cells from this model obtained by using only a Hubel and Wiesel type wiring scheme (Hubel and Wiesell962; see Chapman et al. for some recent evidence supporting this model). Orientation tuning is conferred on these cells by virtue of this specific type of afferent wiring from LGN to cortex. Another possibility to achieve an orientation bias would be to use elongated LGN receptive fields (Vidyasagar 1984; Vidyasagar and Urbas 1982; Shou and Leventhal 1989; Soodak et al. 1987). The actual mechanism underlying the orientation bias does not, however, influence the central result of this study and, therefore, we chose to use the rather well-known Hubel and Wiesel connection scheme. Figure 1C shows the orientation tuning after "circular inhibition" (Niebur and Worgotter 1990) is superimposed onto the afferent bias. In this scheme, cortical interneurons preferentially inhibit cells located a certain distance away (Fig. 1A). Worgotter et al. (1991) showed analytically that at a radius of half a hypercolumn (A/2), circular inhibition acts as a weakly tuned cross-orientation inhibition (Benevento et al. 1972) for most cells, because (X/2) is the smallest radius at which cells with orthogonal orientation preferences contribute inhibition. This effect, however, is not strong, because inhibition from many orientations is included. Circular inhibition not only results in an increase in orientation tuning, but also in a marked increase in direction tuning, which attains' D = 23% (DI = 44%). Some degree of direction tuning is expected for any connection scheme with probabilistic variability, since the net synaptic input to a cell from a given direction will, in general, not be identical to that from the opposite direction. In our simulation, "random" inhibition 'The direction index DI is based on the differenceof cell responsesonly along the axis of preferred direction (Orban 1984). A more reliable measure for direction selectivity that takes into account the cell response for all directions is D,the first moment of the statistical distribution of responses (Swindale et al. 1987; Worgotter et al. 1990). We will always give values of D and DI in the text.

F. Worgotter, E. Niebur, and C. Koch

334

A

I \ -

- /

5

CORTEX

..... iriii ::it:

LGN

::::: ::::: 0.

.

::it!

Figure 1: Direction tuning arising from an isotropic connection scheme. (A) Part of the connection diagram for cortical cells in the detailed cortex model (Wehmeier ef al. 1989; Worgotter and Koch 1991). Part of the column structure is depicted on top. Orientation columns run parallel to the y-axis in the simulated patch. Each cortical cell receives inhibitory input from other cortical cells located at a distance of about half a hypercolumn ("circular inhibition"). On average 100 cortical cells converge onto each target cell. (B)Average direction tuning curve obtained with LGN convergence but without intracortical inhibition. Peak impulse rates are plotted as a function of the stimulus angle o after rotation of all polar plots to a common preferred orientation. Without intracortical inhibition, D = 8% (DI= 16%). (C) Average tuning curve after including circular inhibition. Although circular inhibition consists of isotropic connections it leads to a clear direction bias (D= 23%, DI = 44%). arising from about 200 cells randomly selected within a certain distance from the target cell leads to a direction tuning of D = 12% (DI= 28%) (see Worgotter and Koch 1991). These probabilistic effects are, however, small compared to the strong increase in D observed with circular inhibition. How does this directional anisotropy arise from isotropic intracortical connections? Obviously, the cortical column structure, which is

Generation of Direction Selectivity

335

predetermined by the afferent orientation bias, must supply the necessary anisotropy. A better understanding of this effect is desirable than that which is obtained from the detailed simulation in which the phenomenon is obscured by the large number of parameters. Therefore, we developed a different model, stripping the cells from all properties except their orientation tuning. A cell is then completely described by equation 1.1 below. In Figure 2A, a part of the simplified cortical column structure is shown. Short lines indicate the preferred orientation 4 of cortical cells, described by d(x) = .(x/X). In this equation, x is the spatial coordinate along the horizontal axis. The target cell in the center receives inhibitory input from all cells located on a circle with a radius r of half a hypercolumn, r = X/2. A cell with preferred orientation 4 is assumed to respond with the activity function C(y - 4) = co

+

c2

cos(2y - 24)

(1.1)

to a stimulus bar with angle y across the receptive field of the cell, where Co and C2 are constants satisfying CO2 C2 > 0. Equation 1.1 is a simple model for an orientation selective cell with preferred orientation 4. The response of this cell is maximal (Co C2), if the stimulus is aligned with the preferred orientation (i.e., y = 4)and minimal (CO- Cp),if the orientation of the stimulus is orthogonal to the preferred orientation of the cell (i.e., y = $ f 7r/2). The half-width-at-half-height of the orientation tuning is adjusted to match the average orientation tuning of cortical simple cells (Orban 1984). A retinotopic projection is assumed to exist over distances of > 200 pm, as has been observed experimentally (Albus 1975), but not necessarily over much shorter distances. We restrict ourselves to the treatment of inhibitory intracortical connections, because the generation of an anisotropic directional effect from isotropic connections is not dependent on the sign of the interaction. We assume that the inhibitory input to the target cell arises from the cell on the circle that is excited first by the moving stimulus bar. We will further assume that, due to a fixed axonal propagation delay, this inhibition arrives within a small time window together with the activation of the center cell. Thereby we neglect effects induced by different stimulus velocities. None of the above assumptions influences the qualitative observation of the generation of a direction bias as such but only its actual strength (e.g., see Fig. 2D for the radius dependency). The inhibition elicited by a certain stimulus is given by the activity function defined above. A graphic representation of this inhibition is provided by the length of the cross section that cuts through the polar-plot along the stimulus orientation. The black bars inside the polar-plots in Figure 2B show the amount of inhibition for eight example stimuli with different orientations. Plotting the strength of inhibition against the direction of the stimulus motion reveals the tuning of inhibition of a particular cell (Fig. 2C). The diagram shows that motion with a downward component elicits significantly less inhibition than motion

+

F. Worgotter, E. Niebur, and C. Koch

336

A

B

D

1

5

Radius

Figure 2: Structural model for the explanation of the direction bias generated by circular inhibition. (A) Part of the simplified cortical column structure. Short lines indicate the preferred orientation 4 of cortical cells. The target cell in the center receives inhibitory input from all cells located on a circle with a radius of half a hypercolumn (r = X/2). A cell with preferred orientation 4 is assumed to respond with the activity function C(y - 4) = CO Cz cos(2y - 24) to a stimulus bar with angle y across the receptive field of the cell (elliptical tuning curves with "wasp waist," half-width-at-half-height orientation tuning x 20"). (B) Stimuli with eight different orientations elicit inhibition defined as the cross section of the moving bar through the receptive field center (black bars inside the tuning curves). ( C )Tuning of inhibition for a cell with horizontal preferred orientation obtained by plotting the lengths of the cross sections against the angle of stimulus motion. Inset: Tuning of inhibition for the same cell including jitter in the orientation columns and using a 0.5" wide bar. (D) Radius dependency of the population average direction bias 0. We compute as average over the D values for cells with all preferred orientations. Curves are shown for straight (curve 1) and realistically bended (curve 2) columns as well as for a patch of an observed cortical column structure [see ref. (Swindale eta!. 19871, curve 31. The equation for the generation of bended columns is given elsewhere (Niebur et al. 1991; Worgotter et al. 1991). The curves are qualitatively similar for r < X/2 and give values for n(r)of up to 38% = 57%).

+

(m

Generation of Direction Selectivity

337

with an upward component, resulting in an average D value of 36%. It is worth emphasizing that the anisotropy of the tuning curve (Fig. 2C) is not due to probabilistic effects, since this model is noise free. The amount of inhibition arising exactly along the axis of preferred motion is identical for both directions. This singular situation occurs only in the oversimplified straight column structure. Including realistic jitter of f10" into the columns (Albus 1975) and using a 0.5" wide bar instead of an infinitely narrow bar increases the direction tuning and removes the singular point (see inset in Fig. 2 0 . The direction bias of individual cells depends on their location in the column structure and can disappear completely at certain locations. Therefore, the average direction bias of the whole cell population, D, gives a better estimate of the overall effect. In Figure 2D the radius for circular inhibition is varied and the radius dependency of is shown for more realistic columnar structures. Curve 1 was obtained from the straight column structure discussed so far; curve 2 from a realistically curved column structure (Niebur et al. 1991; Worgotter et al. 19911, and curve 3 belongs to an observed cortical patch in area 18 of the cat, described by Swindale et al. (1987). Up to a radius of half a hypercolumn ( r = X/2) all curves yield a similar average direction bias. This justifies the analysis of the simple stra4ht column structure. Note that the average direction bias = 26% or DI = 47%) is close to the average direction bias observed in simple cells M 28% or M 50% (Orban 1984; Berman et al. 198711. Although exact quantitative statements are not possible with our simple model, the strength of this effect cannot be neglected. We introduced circular inhibition as an unspecific connection scheme allowing us to investigate the limits of specificity necessary to achieve functional order. Circular inhibition sharpens orientation tuning (Worgotter and Koch 1991; Worgotter et al. 1991); but, similar to cross-orientation inhibition, circular inhibition cannot generate orientation tuning without an initial orientation bias. Direction tuning, however, arises without any preexisting direction bias. It seems plausible that such a readily available direction bias is used in development and strengthened by enhancing those mechanisms that add to its performance. Such an interpretation is supported by the experimental finding that during development, only a small number of cells are initially direction selective. Many more cells are only weakly biased early in development and their direction selectivity increases only after the development of orientation tuning (Grigonis et al. 1988). This would support our notion that direction tuning follows the emergence of orientation selectivity. Our results do not contradict findings that direction tuning can be elicited over short distances (Ganz and Felder 1984). One should, however, note that such short-range connections have to be specifically designed for the generation of direction tuning, whereas we have shown that even unspecific long-range connections will result in a direction bias.

n

(n

[n

338

F. Worgotter, E. Niebur, and C. Koch

Very little connection information (radius and annulus diameter) is necessary to establish circular inhibition, which can be considered as a particular type of long-range lateral inhibition. Thus, circular inhibition represents a rather unspecific and broadly tuned connection scheme that appears easy to implement developmentally. In fact, weak tuning of inhibition compatible with this scheme has been observed recently (Bonds 1989). However, the rigidly defined circular inhibition used in this study is a rather unrealistic connection scheme. Does the generation of a directional anisotropy depend on the details of this scheme? In the following, we will show that this is not the case and that our result is valid for a much larger class of systems than the column structures and connection schemes we have considered thus far. We will show that directional anisotropy can be expected to occur in all realistic column structures and all realistic long-range connection schemes. One key element necessary for the emergence of anisotropy is the inhomogeneity of the orientation column structure, by which the preferred orientations change systcrnatically along the cortex. As a result, any two cells with a sufficiently large distance between them are likely to have different orientation tuning. Consequently, connections originating from those cells (i.e., long-range connections2)will have different impact on the target cell and, thus, produce directional tuning. Without further assumptions on the distribution of the connections, this effect will be reasonably strong only if the number of converging cells is small. The reason is that even if the number of connections is large and they are spread out at random, the contributions will average out to a large extent and only a small anisotropy will be the result. This is avoided if the connections are not distributed at random, but ”clustered” in sufficiently small areas (Gilbert and Wiesel 1983), each one encompassing less than the full range of orientations. In summary, locally correlated activity with a drop-off of the correlation at larger distances and clustered connections with a sufficiently large distance between the clusters is all that is needed to generate direction tuning from any realistic (i.e., containing large cell numbers) long-range connection scheme. Note that the term “cluster” should by no means be interpreted in a narrow sense, for example, in the case of circular inhibition, there is basically only one “cluster,” namely the circle of connections around the target cell. This shows that circular inhibition is, in a sense, a “worst-case scenario,” because it is an isotropic connection pattern. Furthermore, the direction tuning arising from the interaction of clustered synaptic connections and an underlying inhomogeneous column structure is not limited to orientation selective cells. This could provide an explanation for the direction selectivity that is observed in response to nonoriented stimuli, like random dots (Hammond 1978). 20bviously, “long-range connection” refers to a connection whose length is at least comparable to the distances over which the preferred orientation changes appreciably, i.e., a hypercolumn. This may be taken as the definition of the term “long-range.”

Generation of Direction Selectivity

339

Thus, it seems that the columnar organization of the cortex, together with basically. all long-range connection schemes (in the sense defined above), leads inevitably to the emergence of direction selectivity. The search for specific intracortical mechanisms for direction tuning might, therefore, be in vain.

Acknowledgments

We thank Dr. J. Knierim for a critical reading of the manuscript. F. W. is supported by the Deutsche Forschungsgemeinschaft, E. N. by the Swiss National Science Foundation, Grant 8220-2594. C . K. acknowledges the support of the Air Force Office of Scientific Research, a NSF Presidential Young Investigator Award, and the James S. McDonnell Foundation.

References Albus, K. 1975. Exp. Brain Res. 24, 181-202. Benevento, L. A., Creutzfeldt, 0. D., and Kuhnt, U. 1972. Nature (London) 238, 124-126.

Berman, N. E. J., Wilkes, M. E., and Payne, B. R. 1987. J. Neurophysiol. 38, 676-699.

Bonds, A. B. 1989. Vis. Neurosci. 2, 41-55. Chapman, B., Zahs, K. R., and Stryker, M. P. 1991. J. Neurosci. 11(5), 1347-1358. Ganz, L., and Felder, R. 1984. J. Neurophysiol. 51, 294-324. Gilbert, C. D., and Wiesel, T. N. 1983. J. Neurosci. 3, 1116-1133. Grigonis, A. M., Zingaro, G. J., and Murphy, E. H. 1988. Dev. Brain Res. 40, 315-318.

Hammond, P. 1978. J. Physiol. (London) 285, 479491. Hoffmann, K. P., Stone, J., and Sherman, S. M. 1972. J. Neurophysiol. 35,518-531. Hubel, D. H., and Wiesel, T. N. 1962. J. Physiol. (London) 160, 106-154. Niebur, E., and Worgotter, F. In Proceedings of the IJCNN'SO, Sun Diego, pp. II367-11-372. IEEE Press, Piscataway, NJ. Niebur, E., Worgotter, F., and Koch, C. 1991. In Proceedings ofthe 3rd Midwestern Conference on Neural Networks, S . Sayegh, ed., Purdue Research Foundation Press, W. Lafayette, pp. 67-74. Orban, G. A. 1984. Neuronal Operations in the Visual Cortex. Springer, Berlin. Shou, T., and Leventhal, A. G. 1989. J. Neurosci. 9, 4287-4302. Soodak, R. E., Shapley, R. M., and Kaplan, E. 1987. J. Neurophysiol. 58,267-275. Swindale, N. V., Matsubara, J. A., and Cynader, M. S. 1987. J. Neurosci 7, 14141427.

Vidyasagar, T. R. 1984. Exp. Brain Res. 55, 192-195. Vidyasagar, T. R., and Urbas, J. V. 1982. Exp. Brain Res. 46, 157-169.

340

F. Worgotter, E. Niebur, and C. Koch

Wehmeier, U., Dong, D., Koch, C., and Van Essen, D. 1989. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 335-359. MIT Press, Cambridge. Worgotter, F., Grundel, O., and Eysel, U. T. 1990. Eur. J. Neurosci. 2, 928-941. Worgotter, F., and Koch, C. 1991. J. Neurosci., 11(7), 1959-1979. Worgotter, F., Niebur, E., and Koch, C. 1991. J. Neurophysiol. 66(2), 444-459.

Received 3 May 1991; accepted 7 November 1991.

This article has been cited by: 1. Kirk G. Thompson, Yifeng Zhou, Audie G. Leventhal. 1994. Direction-sensitive X and Y cells within the A laminae of the cat's LGNd. Visual Neuroscience 11:05, 927. [CrossRef] 2. Florentin Wörgötter, Ernst Niebur. 1993. Cortical column design: a link between the maps of preferred orientation and orientation tuning strength?. Biological Cybernetics 70:1, 1-13. [CrossRef]

Communicated by Christof Koch

Binding Hierarchies: A Basis for Dynamic Perceptual Grouping Erik D. Lumer Bernard0 A. Huberman Department of Applied Physics, Stanford University, Stanford, C A 94305 USA and Dynamics of Computation Group, Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, C A 94304 U S A Since it has been suggested that the brain binds its fragmentary representations of perceptual events via phase-locking of stimulated neural oscillators, it is important to determine how extended synchronization can occur in a clustered organization of cells possessing a distribution of firing rates. To answer that question, we establish the basic conditions for the existence of a binding mechanism based on synchronized oscillations. In addition, we present a simple hierarchical architecture of feedback units that not only induces robust phase-locking within and segregation between perceptual groups, but also serves as a generic binding machine. 1 Feature Integration via Phase-locking When we look at a familiar scene, we are able to rapidly identify many objects as such. In the process, our attention can be selectively captured by visual groupings, even when their constituent elements are not spatially contiguous (Duncan 1984). Somehow, the distributed and overlapping patterns of activity in our brain are translated into distinct percepts. Finding the mechanism by which the fragmentary representation of unitary objects is “glued” together is known as the binding problem. Several tentative answers have been proposed. Von der Malsburg (1981) hypothesized that temporally coinciding stimuli were represented by the correlated firing of neurons. It was further suggested that the synchronization of oscillatory activities in the nervous system is the mechanism by which local visual features are linked into coherent percepts (Gray et al. 1989; Eckhorn et al. 1988) and visual events are registered (Damasio 1989; Crick and Koch 1991). Some ground has been given to these conjectures by recent reports of experiments indicating synchronized oscillatory responses in the cat primary visual cortex to stimuli such as elongated or colinear moving Neural Computation 4, 341-355 (1992)

@ 1992 Massachusetts Institute of Technology

342

Erik D. Lumer and Bernard0 A. Huberman

bars (Gray et al. 1989; Eckhorn ef al. 1988). Phase-locked oscillations in the range 35-80 Hz were found locally within a vertical cortical column and also in hypercolumns with nonoverlapping receptive fields separated by as much as 7 mm. Several models have been proposed in order to study the emergence of collective synchronization from the interactions of neural oscillators (Kammen et al. 1989; Eckhorn et al. 1989; Atiya and Baldi 1989; Sompolinsky et al. 1990). A common limitation of these models, however, lies in the lack of heterogeneity of their embedding network. This is important in light of the fact that cells within cortical layers are divided into groups that form columns (Mountcastle 1977) that are in turn organized into smaller aggregates. As a result, long-range connections are sparse and assembled into a branching structure called a cluster (Gilbert and Wiesel 1983). The preferred connectivity between cells coding similar features imposes yet another kind of ordering of the interaction strengths among oscillators. Accordingly, one could expect the very specific cortical organization to be reflected in the dynamics of the firing cells. As a matter of fact, experiments seem to indicate that the probability for the occurrence of phase locking is a decreasing function of both the spatial separation between recording sites and the difference between the preferred stimulus orientation of the probed cells (Gray 1989). Purely psychological studies of perceptual grouping based on measures of subjective similarity also reveal a complex representation of sets of stimuli, which is often hierarchical (Shepard 1980). In spite of the complex connectivity between cortical cells, a state of synchrony among neurons receptive to similar features does seem to arise in a number of studies. The ensuing potential of this state to serve as a representation of global percepts has triggered the development of several artificial systems, such as systems of coupled oscillators for sound discrimination (Von der Malsburg and Schneider 19861, texture discrimination (Pabst et al. 1989; Baldi and Meir 19901, shape recognition (Humme1 and Biederman 1990), and computer models of perceptual grouping in the brain (Sporns et al. 1989, 1991). Although very relevant to the binding problem, these simulations do not contain analytical predictions concerning the synchronization properties of large assemblies of oscillators organized in clusters. The need for better modeling and further theoretical analysis of the complex structures and mechanisms that induce collective rhythms among cortical cells motivates the present paper. We present here the first results derived from a study of hierarchical architectures of coupled neural oscillators. Such structures seem particularly relevant to support possible neurobiological binding mechanisms (Damasio 1990; Crick and Koch 1991). A useful abstraction consists in dividing binding architectures into two groups that enable distant phase-locking: one via long-range hori-

Binding Hierarchies

343

zontal connections between cells in remote clusters, and the other as a result of common feedback projections (Kammen et al. 1989). In the next two sections, we study the first group via a model of coupled oscillators having a hierarchy of coupling strengths, and an associated ultrametric measure that quantifies the distance between any two oscillators in the hierarchy. The coupling strengths are set to decrease with ultrametric distance. We assume that all the cells in the population are excited by some external stimulus so that they oscillate with intrinsic frequencies that are distributed around some fixed average value. We then analyze the collective states of excitation in the population in order to establish the necessary conditions for the working of binding mechanisms based on synchronized oscillations. As a result, we obtain analytical bounds on the minimum connectivity between cells, their various coupling strengths, and the distribution of their intrinsic frequencies. In Section 4, we propose and investigate numerically an alternative binding structure that consists of a simple hierarchy of feedback units. When compared to networks with direct reciprocal links, this architecture offers the advantage of minimal connectivity and fast induced phaselocking, even in the case when background noise is unavoidable. The analysis of Section 3, although performed for networks of cells with reciprocal links, provides a strong guidance for designing computer simulations of the feedback hierarchy. These simulations demonstrate how the network can group and/or segregate stimuli in a fashion consistent with the Gestalt laws of spatial continuity and proximity. Given that hierarchical structures are ubiquitous in visual processing systems, we think of this architecture as a generic model of a binding machine. 2 Model of Clustered Dynamics

We model a large population of weakly coupled neural oscillators by their phase dynamics along their limit cycles. That this can be done for pulsed neurons has been recently shown by Kuramoto (1990). Similar models have been derived and extensively used in modeling physical (Kuramoto 1984) and biological systems (Winfree 1980; Cohen et al. 1982). In the simplest case, the evolution of a set of N oscillators coupled via reciprocal interactions is given by N

3 = wj + C K i j h ( 0 i - 0,) dt

(2.1)

i=l

where 0, is the phase of the jth oscillator, wj its intrinsic frequency, and Kij a reciprocal coupling strength between oscillators i and j. The interaction between i and j is weighted by an odd periodic function, /I(.), of their phase difference. In the present context, the variable wj represents the intrinsic frequency of a neural oscillator, j, in the absence of coupling. We assume that the intrinsic frequencies are distributed over some range

Erik D. Lumer and Bernard0 A. Huberman

344

around a fixed average value. The choice of a constant average value across the population, independent to a large extent of the specific stimulus configuration, is consistent with experimental observations (Gray 1989). On the other hand, it seems appropriate to represent the spread of frequencies around the average value, a spread that is likely to arise from discrepancies in the physical structure of the activated neuronal groups or local variations in the signals generated by the external stimulus (Eckhorn et al. 1988). We model such distribution with a density function f(w). The matrix of coupling strength, K, defines the architecture of connections between oscillators. Two extreme connectivity patterns have already been thoroughly studied. The first class, a fully interconnected population of infinite size, was shown by Kuramoto to exhibit a phase transition to a state of collective synchrony at a critical value K, of the uniform coupling strength. The other case, corresponding to a regular lattice of oscillators with nearest neighbor coupling, has been studied by Daido (1988), who clarified the dependence of the onset of global synchronism on the lattice dimension. A more relevant architecture for visual processing, and the one we consider here, is given by a hierarchy in which oscillators in a cluster couple at a given strength while weaker links convey the intercluster interactions. Figure 1 gives a visual representation of such structure, in which the oscillators are the leaves of a regular tree. The ultrametric distance,' lij, between any pair (i, j ) of oscillators is defined as the level of their nearest common ancestor node in the tree. Notice that the tree is not meant to represent the real network of coupling connections but only the way their strength or equivalently their density varies with distance. In fact, the coupling strength between the oscillators i and j is set to be a decreasing function of li,, Kij = Kd(lij)

(2.2)

As we will show, the profile of coupling strengths, &l), defines the range of macroscopic behaviors exhibited as the control parameter of the dynamics K is swept across its critical values. The other relevant parameters of our model are the branching ratio b of the embedding tree, its depth L, and the asymptotic behavior of the distribution of frequencies f(w)in the limit of large deviations from the average 3. Without loss of generality, one may write in this case

f(w)0: Iw -a\--'

(2.3)

with 0 < a 5 2. The choice of an asymptotic distribution that is a power law is motivated by the observation that spatial variations of an external stimulus are likely to induce a steady dispersion of the firing rates among 'The concept of ultrametricity as a measure of perceptual similarity was introduced by Shepard (1980).

Binding Hierarchies

345

ultrametric distance

spatial distance

Figure 1: Hierarchy of coupling strengths. The oscillators are at the leaves of the tree. The ultrametric distance between two oscillators is given by the level of their nearest common ancestor in the tree. For instance, the distance between i and j in this figure is equal to 2. Their ancestor node also defines the cluster of oscillators branching out (in bold in the figure). the responding neuronal groups. Since these variations can be arbitrary, it is reasonable to consider more general classes of distributions than the uniform or gaussian ones that are usually used to model the effect of random noise. This can be done by varying the parameter a across its allowed range. Indeed, the analytical results presented below apply not only to populations with a broad distribution, such as the Lorentzian (for a = l),but also to the ones possessing a finite variance. In the latter case, one can set a to 2 as far as the study of the collective dynamics of the system is concerned (see Daido 1988, for a rigorous treatment of this point).

Erik D. Lumer and Bernard0 A. Huberman

346

3 Phase Transition Cascades into Collective Synchrony

The potential for grouping and segmentation of a population of oscillators with hierarchical interactions depends on the nature of the macroscopic synchronized states. We outline in this section the theoretical analysis of our model and complement it with numerical simulations. More detailed proofs of our results, involving a fairly long renormalizationgroup analysis, are reported elsewhere (Lumer and Huberman 1991). First, we have to express that the total coupling to any oscillator j remains bounded as the population size grows, that is N

This relation causes the profile function d(1) to fall off quickly with increasing ultrametric distance. The left-hand side of equation 3.1 can be factored into contributions from the (b - 1)bl-l oscillators at the same ultrametric distance I from j. The previous equation then reads L

C(b

-

1 ) P 1 d ( l )= 1,

L >> 1

(3.2)

k 1

where the sum extends over all the levels in the coupling tree. Two typical profiles are found to satisfy the above constraint. The first one, which we call linear, is defined as 1

d(1) =

I

L(b - 1)bI-l

(3.3)

and amounts to an equal contribution to the coupling strength from the clusters at all scales in the population. In other words, the fall off of the coupling strength with ultrametric distance is exactly balanced by the increase in the number of interacting oscillators at such distance. Using a renormalization technique similar to the one developed by Daido (1988) for lattice structures, we have shown that the evolution of the average phase within a large cluster is driven by interaction terms with other clusters proportional to their average phase difference. Thus our results apply to more complex hierarchies than just regular ones, since the net interaction between any two large clusters of oscillators is independent of the detailed connections between them. We can further demonstrate that if the distribution of intrinsic frequencies drops fast enough away from the average value (i.e., if a > 11, a single phase transition to a collective state of synchrony among all the oscillators will arise as soon as the coupling constant K exceeds a threshold value 1

(3.4)

Binding Hierarchies

347

which implies that a larger dispersion in frequency moves the transition threshold to a higher value. A similar relation exists between the critical coupling strength and the width of dynamic noise added to a uniform firing rate among the population of oscillators (Sompolinski ef al. 1990). A collective state of synchrony is defined by the presence of a macroscopic number of cells (i.e., of order N) oscillating with a common frequency dO/df. Notice that because the frequency-locked neurons have different intrinsic firing rates, they cannot be strictly in phase. The phase dispersion among these cells is an increasing function of O w / K , with Aw measuring the intrinsic frequency spread. Thus, if the coupling constant is large compared to the dispersion of firing rates, collective synchrony will also imply near phase-locking among the oscillators. Therefore, frequency-locking in assemblies of oscillators is both a precursor and a strong indicator of phase-locked states that are relevant to the binding problem. Since oscillators with a linear profile of couplings exhibit a single nonlocal state of synchrony, they cannot separate percepts according to the Gestalt cues of spatial proximity. Of greater interest is a structure that allows for the perceptual grouping of elements close enough in spatial or feature space via synchronization, while segregating sharply more distant ones. We therefore consider a second class of profiles, which we shall call exponential, defined as d(1) =

1

( b - 1)b'-1

a-1 x-

af '

u > l

(3.5)

Notice that within this profile class a unit increment of the ultrametric separation corresponds to a reduction by a geometric factor u of the total coupling with oscillators at the new distance. Such profiles will lead to a discrete transition to nonlocal synchrony in the limit of a population of infinite size if the following inequalities are satisfied:

b> - aa/(O-*),

cy

>1

(3.6)

where the parameter cy measures the frequency dispersion according to equation 2.3. Provided that the asymptotic distribution of frequency drops faster than a power-law with exponent a=1, the above relations define a critical connectivity b, = aa/(o-l) in the network of interactions that increases with the profile factor a and the dispersion of frequency. As soon as the branching ratio drops below b,, the collective dynamics in the cell assembly changes dramatically. No global synchrony is possible in a system of infinite size; instead, increasing K drives the system through an infinite cascade of abrupt transitions to synchrony within clusters of increasingly larger dimensions, which reach global synchrony only in the unphysical limit of an infinite coupling constant. Also, a larger profile factor a results in broader windows of coupling constant values for which the maximum size of synchronized clusters remains unchanged.

Erik D. Lumer and Bernard0 A. Huberman

348

It is possible to show in a rigorous manner that the system described by equation 2.1 exhibits these distinct regimes, depending on whether b is smaller or larger than b,. The proof consists in renormalizing equation 2.1 so as to derive the equations of evolution for aggregated cluster oscillators at a given level I in the hierarchy. These equations are formally identical to equation 2.1, with the individual phases and frequencies being replaced by cluster averages and the interaction terms properly rescaled. When b is smaller than b,, the rescaling factor is such that the interactions terms converge to zero as the level I considered increases, a fact that indicates the absence of a single state of global synchrony for any finite value of the coupling constant (Daido 1988; Lumer and Huberman 1991). In the case of the branching ratio smaller than b,, the positive value of an order parameter for K larger than a critical threshold similar to that in equation 3.4 is the signature of a single phase transition to global synchrony. In place of a formal proof, we can obtain a qualitative understanding of these properties when the dispersion of firing rates is gaussian. Suppose that K is driven through the threshold value K, at which clusters, internally synchronized up to level 1 - 1, lock their frequencies. Then, each newly synchronized cluster at level I becomes equivalent to a single giant phase oscillator as far as its interactions with other such assemblies is concerned. Its overall intrinsic frequency is distributed according to a gaussian, which is now narrower by a factor & than the distribution of aggregated frequencies one level below. At the same time, the effective coupling between clusters at level I is reduced by the profile factor a. If &>a

(3.7)

Equation 3.4 shows that clusters at level I should also synchronize as K exceeds &. Since the same argument applies recursively to any two consecutive levels larger than I, one deduces that a single transition will lead to the bulk synchronization of all clusters larger than a certain size. (Notice that condition 3.7 is exactly the one predicted by equation 3.6 when a is set to 2.) If the effective coupling strength drops faster than the frequency dispersion, that is, a > 6, successive threshold values of K that correspond to a unit increment of the maximum ultrametric distance separating two synchronized oscillators are spaced by a multiplicative factor, which should converge to u/v% as one moves up in the coupling tree. We have simulated a discrete version of the system 2.1 for a population of 1024 oscillators. Equation 2.1 is integrated forward in time for 9000 iterations, with a temporal step dt = 0.05. We refer the reader to the caption of Figure 2 for a description of the simulations. To summarize our findings, we notice that when the coupling profile drops exponentially, a cascade of phase transitions is observed, and one can control the maximum size of synchronized clusters by an appropriate choice of the coupling constant. Notice, however, that the value of this coupling con-

Binding Hierarchies

349

stant increases roughly as the square root of the size of the synchronized population, a feature that is neither desirable nor implemented in real systems. Nevertheless, the exponential profile of coupling does provide a critical shape to which others might be easily compared. Consider, for example, a coupling profile that decreases slower than exponentially below a certain distance d and faster beyond, possibly becoming null at finite range. Our analysis indicates that in this case collective synchrony might be reached in clusters of size d or smaller for finite values of the coupling strengths, but with clusters of size d or larger being mutually incoherent. Thus, the properties of perceptual grouping, that is, the strong linking of related activities in spatial or feature space, along with the sharp desynchronization of separated groups, can be achieved. We point out that elaborate computer-based models of cortical systems implement a configuration of synchronizing connections whose density must fall off exponentially with distance in order to reproduce experimental observations (Sporns et al. 1991). Our study might be a theoretical justification for this empirical design.

4 A Binding Architecture Under suitable conditions, a population of oscillatorswith reciprocal coupling links is capable of grouping neighbor elements and segregating distant ones. However, this architecture presents several limitations in its grouping abilities. First, it was shown in a special case that synchronization in such assemblies is a very slow process (Kammen et al. 1989). Our numerical experiments confirm this point. Also, when the intrinsic firing rates suffer from a steady dispersion, the interactions have to be further increased for the synchronization (i.e., frequency-locking) to occur with a small phase spread. This entails a very high degree of connectivity among the cells. Kammen et al. (1989)recently demonstrated that special feedback units called comparators are very effective in inducing fast phase-locking between the neural oscillators they connect to in a fashion immune to high levels of external noise. In a simple comparator model, cells project to a unit that couples them via a common feedback of their average phase 6. Their dynamics is thus given by do. - = wj wl(6 - 0,) (4.1) dt Notice that equation 4.1 does not contain any spatial dimension. Therefore, the comparator will group similar stimuli regardless of their geometrical arrangement or separation, in contradiction with known Gestalt laws. We propose here an alternative architecture to the one presented in last section that combines the advantages of the comparator model with the segregation capabilities of the networks discussed earlier. To do so,

+

Erik D. Lumer and Bernard0 A. Huberman

350

N. N

0 s

‘I

1

1.5

2

0.8

2.5

r“ I

3

j’

c::[ 0.2

f

0.5

1

1.5

2

2.5

3

Figure 2: Phase transitions to synchrony. The largest fraction of frequencylocked cells, N s / N , vs coupling constant in a population of 1024 oscillators is shown. Averages of the frequencies dO;/dt, a;, were computed to divide the population in synchronized clusters, where Is2j - Oil < 1/2T was used as the criterion of frequency locking between i and j, following Daido and others. T designates the integration time. (a) A single transition to global order. The parameters are choosen to satisfy eq. (3.6). Specifically, the profile function is exponential with a = 1.2, the distribution of frequencies gaussian (with variance D = 1.2) and the tree binary (b = 2). The coupling function has the form h(A0) = sin(2rAB)/Zr. (b) The profile is exponential with a = 2 and the same parameters as above. A cascade of transitions to synchrony up to a certain ultrametric distance is observed, as indicated by the heights of the plateaus equal to fractions of power of 2. Notice the different scale along the K-axis. (c) A discrete transition is recovered in (c) by doubling the branching ratio of the tree ( b = 4). The profile factor is still 2, which leads to the equality of both sides in relation 3.6. (d) The profile factor is set to its value in (a) but the frequency distribution here is Lorentzian (i.e., Q = 1). The simulation in this case seems to indicate the absence of global synchrony for any finite value of K . Indeed, although about 80%of the population is synchronized for K = 3, we verified that only 90% of the cells were synchronized for K = 27. This behavior contrasts with that in simulations (a-c), where a 100% synchronization is obtained shortly after more than half of the cells have locked their frequencies. Further simulations with larger populations are needed to validate the theoretical predictions in this case, which are derived in the limit of an infinite number of coupled cells.

Binding Hierarchies

351

we embed the feedback units in a tree in a way such that the coupling strength between successive levels decreases exponentially. Such a structure of minimal connectivity seems particularly relevant to support possible neurobiological binding mechanism (Damasio 1990; Crick and Koch 1991). We constructed a hierarchy of modified comparators by assuming that each node at any level receives as input the phases from its descendant nodes, computes their average value, and increments it with the coupling term fed back by its ancestor. The resulting value represents the node updated phase placed at its output connections. It is propagated up to its ancestor unit and fed back to its own children. We made the additional assumption that coupling is effective only between excited cells, that is, the ones receiving detectable activities at their inputs. This is the case, for instance, if the coupling links between real neurons (as opposed to the reduced phase oscillators) are modulatory (Eckhorn et al. 1989). A discrete dynamics of the system can be expressed as follows. The phases of the neural oscillators are updated according to Bj(t

+ 6t) = B j ( t ) + St x {wj + T

(z)

Kh [Ba,](t) - B j ( t ) ]

}

(4.2)

where Oak]. is the average phase of j’s ancestor feedback unit. The threshold of activity T is defined by

($)-{0

T - -

ifz>w,

1 otherwise

(4.3)

where GBjlSt represents j’s instantaneous frequency. The phases of the units at level 2 larger than 1 and less than L - 1 evolve according to

The sum on the right side of equation 4.4 extends over the Ncbl active children of j. d(2) gives the profile decrement at level 2. Finally, the updated activity of the root unit is given by (4.5) As for models of networks with horizontal connections, the proposed hierarchy of feedback units possesses a number of free parameters, in particular its profile function, d(2). The analysis of the previous section suggests that exponential profiles of coupling might again play a unique role in defining the range of macroscopic behaviors exhibited as the con-

352

Erik D. Lumer and Bernard0 A. Huberman

Figure 3: A cascade of phase transitions to phase-locking of progressively larger clusters is observed in a 16 x 16 array as the coupling constant is increased. Successive values of K are 1, 15, 25, and 40. All the oscillators are excited. The instantaneous phase of each oscillator is plotted as a function of its position in the array. The coupling profile is exponential with a decrement factor of 2 between successive levels in the hierarchy. The coupling function is sinusoidal. The results are obtained after only 50 iterations of the system dynamics. The distribution of firing rates is gaussian with variance u = 0.12 and external noise is added. The range of fluctuations of the noise is equal to the variance of the frequency distribution.

trol parameter K is driven across its critical values. Simulations confirm this conjecture. Figure 3a-d illustrates the clustered activities of oscillators arranged in a 16 x 16 array at the bottom of the coupling tree. The two-dimensional tree has a branching ratio of 4. A cascade of transitions to collective phase-locking is observed when the coupling strengths drop fast enough (i.e., exponentially with a geometric factor of 2) with ultrametric distance. The effect of external noise on the short-range phase coherence is canceled by the averaging process in the comparators. Notice also the fast synchronization of clusters achieved after only a few iterations of the binding mechanism. Figure 4 demonstrates the progressive loss of synchrony between the representations of two localized stimuli inputting on a 32 x 32 array of model neurons, as they drift apart. The profile of coupling strengths is again exponential.

Binding Hierarchies

353

‘T

Figure 4: Perceptual grouping and segmentation. Two stimuli are presented to a 32 x 32 array of model neurons. Each stimulus covers a 4 x 4 array of cells. The coupling constant K is set to a value of 30. The other parameters are as in the previous figure. Notice the progressive loss of correlation between the excited regions as the stimuli are moved farther apart.

5 Conclusion Recent experiments seem to indicate the presence of extended coherence within assemblies of cells coding similar features in visual scenes. As a possible explanation of these observations, several authors have conjectured that the brain binds fragmentary representations of perceptual events via synchronization of oscillating groups of neurons. If such is the case, it is important to determine how extended synchronization can occur in a clustered organization of cells oscillating with a distribution of firing rates. To answer that question we established the basic conditions for the feasibility of a binding mechanism based on synchronized oscillations. Constraints were placed on the connectivity between cells, their various coupling strengths, and the distribution of their intrinsic firing rates. We showed that the coupling strength has to drop fast enough with distance to avoid bulk synchronization among all the oscillators. On the other hand, a large dispersion of firing rates precludes the grouping of percepts via neural synchronization for finite values of the coupling strengths. We also designed an architecture that could be regarded as a direct model of neurological mechanisms akin to those described qualitatively by Damasio. Such a structure can be used as an artificial perception network, thus providing both a possibly useful machine and an experimental tool for the study of perceptual binding mechanisms.

354

Erik D. Lumer and Bernard0 A. Huberman

We are currently exploring two possible extensions of the model. The first one imposes adaptive coupling strengths in the hierarchy, so that perceptual grouping and segregation can be learned dynamically. The second allows for the feedback units at various levels to remember the patterns of activities received from their children nodes as well as the corresponding coupling strengths after convergence. This should enable us to study how perceptual memories can be distributed at various scales throughout the hierarchy and retrieved at a later time.

Acknowledgments We thank a referee for constructive remarks. This work was partially supported by the U. S. Office of Naval Research, Contract No. N0001482-0699.

References Atiya, A., and Baldi, P. 1989. Oscillations and synchronizations in neural networks: An exploration of the labeling hypothesis. Int. J. Neural Sysf. 1(2), 103-124. Baldi, P., and Meir, R. 1990. Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Cornp. 2(4), 459471. Cohen, A. H., Holmes, I? J., and Rand, R. H. 1982. The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: A mathematical model. J. Math. Biol. 13, 345-369. Crick, F. H. C., and Koch, C. 1990. Towards a neurobiological theory of consciousness. Sem. Neurosci. 2, 263-275. Daido, H. 1988. Lower critical dimension for population of oscillators with randomly distributed frequencies: A renormalization-group analysis. Phys. Rev. Lett. 61(2), 231-234. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Comp. 1,123-132. Duncan, J. 1984. Selective attention and the organization of visual information. J. Exp. Psych.: General 113(4), 501-517. Eckhorn, R., Bauer, R., Jordan, W., Brosh, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillation%:A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biol. Cybernet. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. 1989. A neural network for feature linking via synchronous activity: Results from cat visual cortex and from simulations. In Models of Brain Function, R. M. J. Cotterill, ed., pp. 255-272. Cambridge Univ. Press, Cambridge. Gilbert, C. D., and Wiesel, T. N. 1983. Clustered intrinsic connections in cat visual cortex. J. Neurosci. 3, 1116-1133.

Binding Hierarchies

355

Gray, C. M., Konig, P.,Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Hummel, J. E., and Biederman, I. 1990. Dynamic binding: A basis for the representation of shape by neural networks. In Proceedings of the 12th Annual Conference of the Cognitive Science Society, pp. 614-621. Lawrence Erlbaum, Hillsdale, NJ. Kammen, D., Koch, C., and Holmes, P. J. 1989. Collective oscillations in the visual cortex. In Advances in Neural information Processing Systems 2 ,D. Z. Anderson, ed. Morgan Kaufmann, San Mateo, CA. Kuramoto, Y. 1984. Progr. Theor. Phys. (Suppl.). 79, 223-241. Kuramoto, Y. 1990. Collective synchronization of pulse-coupled oscillators and excitable units. Physica D,submitted. Lumer, E., and Huberman, B. A. 1991. Hierarchical dynamics in large assemblies of interacting oscillators. Phys. Lett. A 160(3), 1236-1244. Mountcastle, V. B. 1977. An organizing principle for cerebral function: The unit module and the distributed system. In Neuroscience4th Study Program, M.I.T. Press, Cambridge, MA. Pabst, M., Reitboeck, H. J., and Eckhorn, R. 1989. A model of preattentive region definition based on texture analysis. In Models of Brain Function, R. M. J. Cotterill, ed., pp. 137-150. Cambridge Univ. Press, Cambridge. Shepard, R. N. 1980. Multidimensional scaling, tree-fitting, and clustering. Sczence 210, 390-398. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1991. Cooperative dynamics in visual processing. Phys. Rev.A 43, 6990-7011. Sporns, O., Gally, J. A., Reeke, G. N., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U S A . 86, 7265-7269. Sporns, O., Gally, J. A., Reeke, G. N., and Edelman, G. M. 1991. Modeling perceptual grouping and figure-ground segregation by means of active reentrant connections. Proc. Natl. Acad. Sci. U.S.A. 88, 129-133. Von der Malsburg, Ch. 1981. The correlation theory of brain function. Internal Report 81-2, Dept. of Neurobiology, Max Planck Institute for Biophysical Chemistry, Gottigen. Von der Malsburg, Ch., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybernet. 54,29-40. Winfree, A. T. 1980. The Geometry of Biological Time. Springer-Verlag, Berlin.

Received 4 April 1991; accepted 8 November 1991.

This article has been cited by: 1. Juan Acebrón, L. Bonilla, Conrad Pérez Vicente, Félix Ritort, Renato Spigler. 2005. The Kuramoto model: A simple paradigm for synchronization phenomena. Reviews of Modern Physics 77:1, 137-185. [CrossRef] 2. Karl J. Friston, G. Tononi, O. Sporns, G. M. Edelman. 1995. Characterising the complexity of neuronal interactions. Human Brain Mapping 3:4, 302-314. [CrossRef] 3. J. Heagy, T. Carroll, L. Pecora. 1994. Synchronous chaos in coupled oscillator systems. Physical Review E 50:3, 1874-1885. [CrossRef] 4. Michael C. Mozer , Richard S. Zemel , Marlene Behrmann , Christopher K. I. Williams . 1992. Learning to Segment Images Using Dynamic Feature BindingLearning to Segment Images Using Dynamic Feature Binding. Neural Computation 4:5, 650-665. [Abstract] [PDF] [PDF Plus]

Communicated by Rodney Brooks

A Distributed Neural Network Architecture for Hexapod Robot Locomotion Randall D. Beer Departments of Computer Engineering and Science and Biology, Case Western Reserve University, Cleveland, OH 44106 U S A

Hillel J. Chiel Departments of Biology and Neuroscience, Case Western Reserve University, Cleveland, OH 44106 U S A

Roger D. Quinn Kenneth S. Espenschied Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland, OH 44106 U S A

Patrik Larsson Department of Electrical Engineering and Applied Physics, Case Western Reserve University, Cleveland, OH 44106 U S A

We present a fully distributed neural network architecture for controlling the locomotion of a hexapod robot. The design of this network is directly based on work on the neuroethology of insect locomotion. Previously, we demonstrated in simulation that this controller could generate a continuous range of statically stable insect-like gaits as the activity of a single command neuron was varied and that it was robust to a variety of lesions. We now report that the controller can be utilized to direct the locomotion of an actual six-legged robot, and that it exhibits a range of gaits and degree of robustness in the real world that is quite similar to that observed in simulation. 1 Introduction

Even simpler animals are capable of feats of sensorimotor control that exceed those of our most sophisticated robots. Insects, for example, can walk rapidly over rough terrain with a variety of gaits and can immediately adapt to changes in load and leg damage, as well as developmental changes (Graham 1985). Even on flat horizontal surfaces, insects walk with a variety of different gaits at different speeds (Wilson 1966). These gaits range from the wave gait, in which only one leg steps at a time in a back-to-front sequence on each side of the body (this sequence is called Neural Computation 4,356-365 (1992) @ 1992 Massachusetts Institute of Technology

Neural Network Architecture for Hexapod Robot Locomotion

357

Figure 1: A comparison of simulated and robot gaits. Black bars represent the swing phase of a leg and the space between bars represents its stance phase. (Top) Leg labeling conventions. (Left) Selected gaits observed in simulation as the activity of the command neuron is varied from lowest (top) to highest (bottom) (Beer 1990). (Right) Gaits generated by the robot under corresponding conditions. Here the duration of a swing bar is 0.5 seconds. a metachronal wave), to the tripod gait, in which the front and back legs on each side of the body step in unison with the middle leg on the opposite side (see left side of Fig. 1). While most current research in legged robot locomotion utilizes centralized control approaches that are computationally expensive and brittle, insect nervous systems are distributed and robust. What can we learn from biology? In previous work (Beer et at. 1989), w e described a neural network architecture for hexapod locomotion. The design of this network was based on work on the neuroethology of insect locomotion, especially Pearson's flexor burst-generator model for walking in the American cockroach (Periplaneta americana) (Pearson et al. 1973; Pearson 1976). Through simulation, we demonstrated that this network was capable of generating a continuous range of statically stable gaits similar to those observed

358

Randall D. Beer et al.

in insects (see left side of Fig. 11, as well as smooth transitions between these gaits. The different gaits were produced simply by varying the tonic level of activity of a single command neuron. In addition, a lesion study of this network demonstrated both its surprising robustness and the subtlety of the interaction between its central and peripheral components (Chiel and Beer 1989). A natural question to ask is whether these results were just artifacts of the many physical simplifications of the simulation or whether they are robust properties of the network that persist in the presence of such physical realities as delay, friction, inertia, and noise. This is a difficult question to resolve given the subtle dependencies of this controller on sensory feedback (Chiel and Beer 1989). The only way to determine whether this distributed controller had any practical utility was to design and build a six-legged robot and interface it to the locomotion network. 2 Locomotion Controller

The circuit responsible for controlling each leg is shown in Figure 2. Each leg controller operates in the following manner: Normally, the foot motor neuron is active (i.e., the leg is down and supporting weight) and excitation from the command neuron causes the backward swing motor neuron to move the leg back, resulting in a stance phase. Periodically, this stance phase is interrupted by a burst from the pacemaker, which inhibits the backward swing and foot motor neurons and excites the forward swing motor neuron, resulting in a swing phase. The time between bursts in the pacemaker, as well as the velocity output of the backward swing motor neuron during a stance phase, depend on the level of excitation provided by the command neuron. In addition, sensory feedback is capable of resetting the pacemaker neuron, with the forward angle sensor encouraging the pacemaker to terminate a burst when the leg is at an extreme forward position and the backward angle sensor encouraging the pacemaker to begin a burst when the leg is at an extreme backward position. There are six copies of the leg controIler circuit, one for each leg, except that the single command neuron makes the same two connections on each of them. Following Pearson’s model, the pacemakers of all adjacent leg controllers mutually inhibit one another, discouraging adjacent legs from swinging at the same time (Fig. 3). At high speeds of walking, this architecture is sufficient to reliably generate a tripod gait. However, at lower speeds of walking, the network is underconstrained, and there is no guarantee that the resulting gaits will be statically stable. To enforce the generation of metachronal waves, we added the additional constraint that the natural periods of the pacemakers are arranged in a gradient, with longer periods in the back than in the front (Graham 1977). Under these conditions, the pacemakers phase-lock into a stable metachronal

Neural Network Architecture for Hexapod Robot Locomotion

359

Command Backward Angle Sensor

r

Swing

-t

Pacemaker

uw Excitatory Connection

ForwardAngle Sensor

c Inhibitory Connection

Figure 2: The leg control circuit. Each leg is monitored by two sensory neurons that signal when it has reached an extreme forward or backward position. Each leg is controlled by three motor neurons responsible for the state of the foot, the velocity with which the leg swings forward, and the velocity with which the leg swings backward, respectively. The motor neurons are driven by a pacemaker neuron whose output rhythmically oscillates. A single command neuron makes the same two connections on every leg controller. The architecture also includes direct connections from the forward angle sensor to the motor neurons, duplicating a leg reflex known to exist in the cockroach. The state of each neuron c j ~ j i f j ( V j ) INTi EXTi, is governed by the equation CidVildt = -Vi/Ri where Vi, Ri, and Ci, respectively, represent the voltage, membrane resistance, and membrane capacitance of the ith neuron, wji is the strength of the connection from the jth to the ith neuron, f is a saturating linear threshold activation function, and EXTi is the external current injected into the neuron. INTi is an intrinsic current present only in the pacemaker neurons that causes them to oscillate. This current switches between a high state of fixed duration and a low state whose duration depends linearly on the tonic level of synaptic input, with excitation decreasing this duration and inhibition increasing it. In addition, a brief inhibitory pulse occurring during a burst or a brief excitatory pulse occurring between bursts can reset the bursting rhythm of the pacemaker.

+

+

+

360

Randall D. Beer et al.

Figure 3: The pacemaker neurons of adjacent leg controllers are coupled by mutual inhibition. relationship. We chose to enforce this constraint by making the range of motion of the rear legs slightly larger than that of the middle legs, whose range of motion in turn is slightly larger than that of the front legs. A complete discussion of the design of this network and its relationship to Pearson’s model can be found in Beer (1990). 3 Robot

To examine the practical utility of this locomotion controller, we designed and built a six-legged robot (Fig. 4, top). The network was simulated on a personal computer using the C programming language and interfaced with the robot via A / D and D/A boards. Because the controller was originally designed for a simpler simulated body (see top of Fig. l),two main issues had to be addressed in order to connect this controller to the robot. First, the locomotion controller assumes that the swing and lift motions of the leg are independent, whereas in the robot these two degrees of freedom are coupled (Fig. 4, bottom). In simulation, this problem was dealt with by having a stancing leg passively stretch between its joint and foot. For the robot to maintain a constant height (h) above the ground, the radial length (Y) of a stancing leg must be adjusted by the simple kinematic transformation Y = h/ cos 0.

Neural Network Architecture for Hexapod Robot Locomotion

361

Figure 4: (Top) The hexapod robot. Its dimensions are 50 cm long by 30 cm wide and it weighs approximately 1 kg. (Bottom) Each leg has two degrees of freedom: an angular motion responsible for swing and stance movements and a radial motion involved in raising and lowering the leg. The swing motion has a range of over 45" from vertical in either direction. The radial motion is accomplished by means of a rack-and-pinion transmission. Both degrees of freedom are driven by 2 W DC motors with associated integral transmissions. Position sensing for each degree of freedom is provided by potentiometers mounted in parallel with the motors.

362

Randall D. Beer et al.

A second compatibility issue derives from the simplified physics utilized in the original simulation, in which the activity of the forward and backward swing motor neurons was translated directly into velocities (Beer 1990). To interface the output of the neural network controller to the physical dynamics of the robot, we made use of the equilibrium point hypothesis for motor control (for review see Bizzi and Mussa-Ivaldi 1990). This hypothesis states that the nervous system generates limb trajectories not by directly specifying joint torques but rather by specifying a sequence of equilibrium positions known as a virtual trajectory. This hypothesis is based on the following two facts: (1) muscles have springlike properties that are in equilibrium when the torques generated by opposing muscles exactly cancel and (2) neural input to a muscle has the effect of selecting a length/ tension curve and therefore an equilibrium position for the limb as a whole. The equilibrium point hypothesis suggests the following approach. The velocity output of the network is integrated to obtain a sequence of virtual positions. These virtual positions are then translated (via the trigonometric transformations described above) into sequences of positions of the swing and lift motors. Finally, these swing and lift positions are fed into position controllers that drive the corresponding motors with voltages proportional to the deviations between the actual positions and the desired positions. These position controllers are implemented in analog circuitry for speed. Because the motors are backdrivable, this scheme also gives a spring-like property to the legs that lends stability to the robot.

4 Results and Discussion

Under the control of the locomotion network, the robot exhibits a range of gaits similar to those observed in simulation as the command neuron activity is changed (see right side of Fig. 1). These gaits range from ones in which distinct metachronal waves are readily distinguished at low speeds of walking to the tripod gait at high speeds of walking. Within this range of gaits, the robot's speed of progression varies from 4.5 to 8.3 cm/sec. In addition, we studied the effects on the robot's walking of a number of lesions that were previously performed in simulation (Chiel and Beer 1989). In all cases, the response of the physical robot was quite similar to what we had previously observed in simulation (Chiel et al., in press). The controller was able to cope with the removal of such components as single sensors, a small fraction of the coupling connections between pacemakers, or the command neuron to pacemaker connections. In addition, we found that the robot was capable of reflex stepping. If the command neuron is disabled but the robot is steadily pushed forward at different speeds by externally updating the position

Neural Network Architecture for Hexapod Robot Locomotion

363

controllers, then it still exhibits the full range of gaits. Thus it appears that the continuous range of gaits is a robust property of the locomotion network and not simply an accident of simulation. Interestingly, this robotic implementation did reveal one weakness of the locomotion controller that we did not think to examine in simulation. While we have found the controller to be quite robust in general to the delays inherent in the physical robot, it is sensitive to asymmetic delays that cause the legs on one side of the body to consistently lag behind those on the opposite side. These asymmetric delays are due to the inevitable variations in the response characteristics of the electrical and mechanical components of the robot. In the presence of such asymmetric delays, the crossbody phasing of the legs is disturbed. Once we identified this problem, however, a simple adjustment to the stiffnesses of the position controllers, which affects the amount that a leg lags behind its prescribed position, restored the proper phasing. Nevertheless, discoveries such as this justify the effort involved in undertaking a robotic implementation. Brooks has described a partially distributed locomotion controller for a six-legged robot known as Genghis (Brooks 1989). This robot is controlled by a network of finite state machines augmented with registers and timers. In Brooks’ locomotion controller, the basic swing/stance cycle of each leg is driven by a chain of reflexes involving only coarse local information about leg position and load. For example, whenever a leg is lifted for some reason, it reflexively swings forward and whenever one leg swings forward, all other legs move backward slightly. With elaborations of this basic controller, the robot not only sqcessfully walked, but could also negotiate small obstacles and follow slowly moving objects with infrared sensors. However, Brooks’ controller is not as fully distributed as the network architecture described in this paper. In Genghis, the movements of the individual leg controllers are coordinated by a single, centralized finite state machine which tells each leg when to lift. Different gaits are generated by moddying this machine. While Maes and Brooks (1990) have recently described a version of this controller that does not require a centralized gait sequencer, their new controller is capable of generating only the tripod gait. In contrast, our neural network controller generates a continuous range of gaits without any centralized gait sequencer. Instead, the different gaits result from the dynamics of interaction between the pacemaker neurons controlling each leg and the sensory feedback that they receive. The architecture described in this paper focuses solely on the problem of sequencing leg movements, so as to maintain static stability at a variety of walking speeds during straight-line locomotion across flat, horizontal surfaces. Of course, gait control is only one aspect of legged locomotion. Other important issues include postural control, turning, negotiation of complex terrain, and compensation for leg damage or loss. In future work, we plan to draw on further studies of insect locomotion and its

364

Randall D. Beer et al.

neural basis to address these additional issues (Burrows 1989; Pearson and Franklin 1984; Cruse 1990). If we had taken a classical control approach to legged locomotion, it is unlikely that a distributed architecture such as the one we have presented here would have resulted. We believe that our results represent a simple example of a very powerful idea: neural network architectures abstracted from biological systems can be directly applied to the control of autonomous agents (Beer 1990). Because they have evolved over a significant period of time, biological control systems are much more flexible and robust than their engineered counterparts. However, they are also much more difficult to understand. Simulation can serve as a n important intermediate step in the process of abstracting a biological control principle. On the other hand, only a physical implementation in an actual robot can prove such a principle’s practical utility. In this paper, we have demonstrated that our distributed locomotion controller is a viable approach to hexapod robot wallung.

Acknowledgments This work was supported by Grant N00014-90-J-1545 to R. D. B. from the Office of Naval Research and Grant NGT-50588 from NASA Goddard. Additional support was provided by the Cleveland Advanced Manufacturing Program through the Center for Automation and Intelligent Systems Research and the Howard Hughes Medical Institute. H. J. C. gratefully acknowledges the support 07 the NSF through Grant BNS8810757.

References Beer, R. D. 1990. Intelligence as Adaptive Behavior: An Experiment in Computational Neuroethology. Academic Press, San Diego. Beer, R. D., Chiel, H. J., and Sterling, L. S. 1989. Heterogeneousneural networks for adaptive behavior in dynamic environments. In Advances in Neural lnformation Processing Systems 1 , D. S . Touretzky, ed., pp. 577-585. Morgan Kaufmann, San Mateo, CA. Bizzi, E., and Mussa-Ivaldi, F. A. 1990. Muscle properties and the control of arm movement. In An lnvitation to Cognitive Science, Volume 2: Visual Cognition and Action, D. N. Osherson, S. M. Kosslyn, and J. M. Hollerbach, eds., pp. 213242. MIT Press, Cambridge, MA. Brooks, R. A. 1989. A robot that walks; emergent behaviors from a carefully evolved network. Neural Comp. 1(2), 253-262. Burrows, M. 1989. Processing of mechanosensory signals in local reflex pathways of the locust. ]. Exp. Biol. 146, 209-227.

Neural Network Architecture for Hexapod Robot Locomotion

365

Chiel, H. J., and Beer, R. D. 1989. A lesion study of a heterogeneous neural network for hexapod locomotion. Proc. Int. J. Conf. Neural Networks [IJCNN 891, I, 407-414. Chiel, H. J., Beer, R. D., Quinn, R. D., and Espenschied, K. S. In press. Robustness of a distributed neural network controller for locomotion in a hexapod robot. To appear in IEEE Transactions on Robofics and Automation. Cruse, H. 1990. What mechanisms coordinate leg movement in walking arthropods? Trends Neurosci. 13(1), 15-21. Graham, D. 1985. Pattern and control of walking in insects. Adu. Insect Physiol. 18, 31-140. Graham, D. 1977. Simulation of a model for the coordination of leg movements in free walking insects. Biol. Cybernet. 26, 187-198. Maes, P., and Brooks, R. A. 1990. Learning to coordinate behaviors. Proc. Eighth Natl. Conf. AI [AAAI-90], 796-802. Pearson, K. G. 1976. The control of walking. Sci. Am. 235, 72-86. Pearson, K. G., Fourtner, C. R., and Wong, R. K. 1973. Nervous control of walking in the cockroach. In Control of Posture and Locomotion, R. 8. Stein, K. G. Pearson, R. S. Smith, and J. B. Bedford, eds., pp. 495-514. Plenum Press, New York. Pearson, K. G., and Franklin, R. 1984. Characteristics of leg movements and patterns of coordination in locusts walking on rough terrain. Int. J. Robotics Res. 3(2), 101-112 Wilson, D. M. 1966. Insect walking. Annu. Rev. Entomol. 11, 103-122.

Received 9 July 1991; accepted 9 October 1991.

This article has been cited by: 1. QiDi Wu, ChengJu Liu, JiaQi Zhang, QiJun Chen. 2009. Survey of locomotion control of legged robots inspired by biological concept. Science in China Series F: Information Sciences 52:10, 1715-1729. [CrossRef] 2. Tao Geng , Bernd Porr , Florentin Wörgötter . 2006. A Reflexive Neural Network for Dynamic Biped Walking ControlA Reflexive Neural Network for Dynamic Biped Walking Control. Neural Computation 18:5, 1156-1196. [Abstract] [PDF] [PDF Plus] 3. Philip Holmes, Robert J. Full, Dan Koditschek, John Guckenheimer. 2006. The Dynamics of Legged Locomotion: Models, Analyses, and Challenges. SIAM Review 48:2, 207. [CrossRef] 4. P. Arena, L. Fortuna, M. Frasca, G. Sicurella. 2004. An Adaptive, Self-Organizing Dynamical System for Hierarchical Control of Bio-Inspired Locomotion. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:4, 1823-1837. [CrossRef] 5. D.W. Marhefka, D.E. Orin, J.P. Schmiedeler, K.J. Waldron. 2003. Intelligent control of quadruped gallops. IEEE/ASME Transactions on Mechatronics 8:4, 446-456. [CrossRef] 6. K.S. Narendra. 1996. Neural networks for control theory and practice. Proceedings of the IEEE 84:10, 1385-1406. [CrossRef] 7. H. Cruse, C. Bartling, G. Cymbalyuk, J. Dean, M. Dreifert. 1995. A modular artificial neural net for controlling a six-legged walking system. Biological Cybernetics 72:5, 421-430. [CrossRef] 8. J. J. Collins, S. A. Richmond. 1994. Hard-wired central pattern generators for quadrupedal locomotion. Biological Cybernetics 71:5, 375-385. [CrossRef] 9. Shinichi Kimura, Masafuni Yano, Hiroshi Shimizu. 1994. A self-organizing model of walking patterns of insects. Biological Cybernetics 70:6, 505-512. [CrossRef]

Communicated by John Platt and David Willshaw

Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System Alan F. Murray Department of Electrical Engineering, University of Edinburgh, Edinburgh EH9 3JL, Scotland

1 Requirement

This paper describes an approach to multilayer perceptron (MLP) learning that is optimized for hardware implementation. Experimental results to date are promising, and it is the express aim of this paper to present these results concisely - a detailed mathematical analysis will follow in a subsequent, longer publication. Error backpropagation (Rumelhart et al. 1986) has achieved remarkable success as an algorithm for solving hard classification problems with MLP networks. It is, however, not readily amenable to VLSI integration, and the distinction it draws between hidden and output nodes renders it hostile to analog circuit forms. Use of the mathematical chain rule, to calculate the effect of a weight connecting to a hidden unit on the errors {Q} in the output layer, renders the error calculation scheme for hidden units different from, and more complicated than that for output units. The Virtual Targets learning scheme circumvents this problem by introducing an explicit "desired state," or target for each of the hidden units, which is updated continuously, and stored along with the synapse weights. While this means that a target state must be stored for each input pattern and hidden node, it simplifies and renders homogeneous the process of weight evolution for all neurons. Furthermore, since a target state is already stored for each output neuron, the scheme essentially removes the distinction during learning between hidden and output nodes. Analog integrated circuits based on the virtual targets strategy will therefore be flexible in architectural terms, as all units will be configurable as either output or hidden layer neurons. The fundamental idea of adapting the internal representation, either as well as, or instead of the weights, is not itself new (Rohwer 1990; Grossman et al. 1990; Krogh et al. 1990). However, these pieces of work were not optimized for hardware implementation. The fundamental difference is that simplicity of implementation has been made the primary goal in the work described in this paper, to produce a system optimized for analog VLSI. There are also several important differences in detail between the work described in this paper and these earlier, similar Neural Computation

4, 366-381 (1992)

@ 1992 Massachusetts Institute of Technology

MLP Learning for On-Chip Implementation

367

approaches. These will be indicated in the next section. Within the boundaries of the tasks it has been applied to, the new scheme does not suffer from such problems, and is furthermore apparently viable as an analog VLSI learning process. Far from being impeded in its efficiency by synaptic inaccuracy in the form of electrical noise, performance is actually enhanced b y the presence of this unavoidable uncertainty. This contradicts the commonly held view that, since analog systems are inherently "inaccurate," because of noise, the requirement for high weight precision during MLP (and other) learning procedures cannot be satisfied with such systems. Equating noise-limited accuracy with digital wordlength-limited accuracy is therefore misleading, and the existing dogma regarding the need for very high "precision" during MLP learning should be reevaluated. In Section 2 of this paper, the Virtual Targets learning approach is detailed, and some experimental results shown in Section 3. Finally, in Sections 4 and 5, a preliminary analog learning chip architecture is proposed, and some conclusions are drawn. 2 "Virtual Targets" Method In an kJ:K MLP Network

The J hidden- and K output-layer neurons obey the usual equations [e.g., ok = f(Xk), where xk = C wkjoj]. Weights evolve in response to the presentation of a pattern p via perceptron-like equations similar to those used in backpropagation. (2.1) (2.2)

where, for instance, output layer errors are ekp = 8kp - okp, where { Z k p } are the target states. The terms ob etc. represent the derivatives of the activation function, 60kp/6Xkpr which effectively discourage learning on weights that connect to neurons that are firmly OFF or ON. The terms ojp and oIp discourage learning on weights that connect from neurons that are firmly OFF. vweights represents learning speed. Note in passing that equations 2.1 and 2.2 involve information local to the layers that are being connected. The departure from backpropagation is the presence of an error signal for the hidden nodes - this is the crux of the virtual targets scheme. Weights are initialized to random values. The learning scheme then operates as follows: 1. Apply input pattern {oip}, and read out the states {ojp}, hidden and output nodes.

2. Assign targets for the hidden nodes,

Ojp

= ojP.

{okp}

of the

Alan F. Murray

368

3. Repeat (1) and (2) for all input patterns. 4. Present patterns in random order, allowing weights to evolve according to equations 2.1 and 2.2, and targets { O j p } according to k=K

"jP = ??targets 6t

EkpWkj

(2.3)

k=O

where qtargets is the target learning speed. In simulation, equation 2.3 must be multiplied by an additional term of the form Ojp(l - Ojp), to restrain hidden node targets to the range 0 5 Olp 5 1. When the target values are stored on-chip as charge on capacitors, this "saturation" will occur naturally. Operations (1) -+ (3) are "initialization" steps, and only (4), with equations 2.1-2.3, describes learning. During learning via equations 2.12.3, whenever the output targets { O k p } are achieved for a particular input pattern p (i.e., the net has learned that pattern successfully) the hidden node targets {Ojp} for that pattern are reinitialized as in (2) above. When a pattern p has been learned successfully, the errors { ~ k p }are, by definition, small. Equations 2.1 and 2.2 will no longer cause weight and target modifications, respectively. The target values { O j p } may, however, not be equal to the states {ojp}, and may merely be exerting a force on { Wji} via equation 2.2 in the appropriate direction. It is therefore necessary to introduce this "reset" mechanism, to cause the learning process to cease to react to pattern p - at least until the weights change to corrupt the successful coding of pattern p . There are several differences in detail between this and the earlier, similar algorithms (Krogh et al. 1990). In Rolwer's work (Rohwer 1990) the hidden node activations are made explicitly the primary independent variables, with the weights related to them by a set of linear equations. In the closest ancestor of the virtual targets method, the CHIR (CHoice of Internal Representation) algorithm of Grossman et al. (1990), the phases of learning that affect weights and target states are distinct, in contrast to the simpler virtual targets algorithm, which adapts weights and targets simultaneously, at different, constant speeds. It: is interesting to note that Grossman's group has evolved a derivative of CHIR that avoids the need to store target states, at the expense of a more complex weight update scheme (Nabutovsky et al. 1990). In the most general exposition of the target concept, (referred to here as the KTH algorithm) Krogh et al. (1990) have encapsulated the above two examples, and the virtual targets scheme, in a single formalism. In the notation of this paper, they introduce values for the hidden node "net-computed" states {ojp} of the form

Here, the parameter T controls the extent to which the hidden node states, or "internal representations" feel the effect of the normal forward flow

MLP Learning for On-Chip Implementation

369

of data (high T ) and the backward flow of error information (low T ) . As Krogh et al. (1990) point out, the CHIR algorithm alternates between high and low T, depending on which set of weights is being adjusted. In the virtual targets method, the two phases of learning are coincident. The net-computed values of the internal representations {ojp} represent the high-T limit, while the target states {Ojp} are retrieved in the low-T limit. During learning these two are reconciled, and the virtual targets algorithm, with the added target reset feature, may be viewed as simply one manifestation of the KTH algorithm. The KTH algorithm can thus be configured to mimic the virtual targets scheme, in which limit it will perform similarly. In all of these target based schemes, the target states {5jP} are taking the pIace of the chain rule in the full backpropagation algorithm, to act as intermediaries in transmitting error ‘informationthrough the intermediate layers. Equations 2.1-2.3 describe the learning scheme’s response to a single pattern {oip} applied to the inputs, with output states compared to their target values (6kp). This is not how MLP training normally proceeds. To mimic conventional MLP learning, each of equations 2.1-2.3 should be allowed to evolve for a short time with each of the input patterns applied, in random order, and repeatedly. The simulation experiments in the next section were performed in exactly this way. It is then possible to derive equations that describe the time evolution of the error signals {CM} and {qP}over one learning epoch (a complete set of input presentations). These are

and (2.6)

where &bc = o i b o i c C d O d b o d c 2 0. Clearly, the individual terms in equations 2.5 and 2.6 can be variously positive or negative, according to the values of the weights and errors. However, the presence of two competing terms in equation 2.6, one modulated by the Vweights and the other by qtargetsis illuminating. This represents the competing forces of perceptron learning on the weights { Wq}, which aims to reduce {qP},and movement of the targets {6jp} to reduce { ~ k p } , which may increase {qP}.We might expect, therefore, that the will not decrease monotonically during learning, particularly during the early stages of learning, when the { E M } are substantial. In addition, equation 2.5 does not guarantee gradient descent in the { Ekp}, since equations 2.1-2.3 were constructed pragmatically, rather than with the express aim of producing gradient descent. We should not therefore be surprised to see occasional “hill climbing” in the output error.

370

Alan F. Murray

3 Experimental Results

The method was applied initially to the standard MLP test problems of Parity and Encoder-Decoder tasks, to determine its functional viability. No fundamental problems were encountered, apart from a tendency to get stuck in local minima, in exactly the same way as in backpropagation learning. To attempt to avoid these minima, noise was injected at the synapse and activity levels, with no serious aspiration that learning would survive such rigors. In other words, each synaptic weight { W,b} in forward pass mode was augmented by a noise source of variable strength, and each value of x, = C W,bob was similarly corrupted. Noise sources of up to 20% on either and both of these quantities were introduced. Somewhat surprisingly, learning on these conceptually simple ”hard” problems was actually improved by the presence of high levels of noise, and the network became stuck in local minima less frequently. Including noise in the backward calculations 2.1-2.3 neither improve nor degrade this result. Figure 1 shows an example of a 4-bit parity learning cycle, with a particularly “bad” set of (randomly chosen) initial weights. The noisefree network immediately settles into a local minimum, where it stays indefinitely. With noise, however, excursions from the local minimum are made around 2000 and 5300 learning epochs, and a solution is finally found at around 6600 epochs. The temporary minima in the error signal are not associated with one pathologically noisy learning epoch, and the hill-climbing seen in Figure 1 takes place over several noisy learning epochs. Learning is clearly enhanced by the presence of noise at a high level (around 20%) on both synaptic weights and activities. This result is surprising, in light of the normal assertion that backpropagation requires at least 16-bit precision during learning, and that analog VLSI is unsuitable for backpropagation. The distinction is that digital inaccuracy, determined by the significance of the least significant bit (LSB), implies that the smallest possible weight change during learning is 1 LSB. Analog inaccuracy is, however, fundamentally different, in being noise limited. In principle, infinitesimally small weight changes can be made, and the inaccuracy takes the form of a spread of ”actual” values of that weight as noise enters the forward pass. The underlying “accurate” weight does, however, maintain its accuracy as a time average, and the learning process is sufficiently slow to effectively “see through the relatively low levels of noise in an analog system. The implication is that while analog noise may introduce inappropriate changes in the {Wab} and {Olp}, the underlying trend reflects the accurate synaptic weight values, and makes the appropriate averaged adjustments. The incidental finding - that higher levels of noise actually assist learning - is not so easily explained, although injection of noise into adaptive filter training algorithms is not unusual. These two findings concerning noise would seem to be perfectly general, and have ramifica-

MLP Learning for On-Chip Implementation

371

0.6

WITHOUT NOISE 0.5

d

fi

5 0.4

E 22

0.3

0.2

WITH NOISE 0.1

OO

5000

Figure 1: Learning cycle for 4-bit parity, with the same set of "bad" initial conditions, with and without noise. tions for all learning processes where weights evolve incrementally, and slowly. In Figure 2 the network settles into a poor local minimum shortly after learning commences, with a mean output error of around 0.45. At around 1300 learning epochs, a better, but still local minimum is found (output error 2 0.27). As the inset shows, the network climbs smoothly out of this local minimum. This effect can be explained as follows. As the network enters the local minimum, a subset of the output patterns is becoming coded correctly (definition of a local minimum). The patterns in this local subset are driving weight changes via equations 2.1-2.3. During this process, the output errors { E @ } are reducing, as are the hidden node errors { q p } . Once all of the patterns in the subset are "learned," the { F @ } are all zero, and the target reset mechanism sets the {qP}to zero, abruptly. In effect, the hidden node targets, and their associated errors, have been acting as elastic forces on learning, which are suddenly

Alan F. Murray

372

ERROR 1.2 1

0.8 0.6

MAXIMUMSINGLEERROR HIDDENLAYERERROR MEANOUTPUTERROR

_____ __________ -

!

0.4 0.2

0

So00

LEARNING EPOCHS

Figure 2: Learning cycle for parity operation showing hill-climbing in output error (inset). removed. If the local minimum is poor, the input patterns not in the local subset assert themselves via equations 2.1-2.3, and the system climbs out of the poor minimum. Only when the minimum is "good enough" as defined by the error criterion does it persist - as it does at N 7000 epochs in Figure 2. This unusual and surprising feature allows the system to respond appropriately to both "poor" and "good" minima, as defined by the output error criterion. It is, I believe, a consequence of the target method, coupled with the reset mechanism outlined in Section 2. Parity and Encoder-Decoder tasks are not representative of the real classification/generalizationproblems that an MLP might aspire to solve. As an example of a "real" classification task, with both training and test data sets, the Oxford/Alvey vowel database formed the vehicle for a 542711 MLP to learn to classlfy vowel sounds from 18 female and 15 male speakers, using the virtual targets strategy. The data appear as the analog outputs of 54 band-pass filters, for 11different vowel sounds, and 33 speakers. Figure 3 shows an example of a learning cycle involving the first 5 female speakers. The figure charts the evolution of the modulus of the average output error, the hidden node error, and the maximum single-bit output error. This latter error is included to avoid the situation where the average output error is extremely low in the presence of a number of large bit-errors in the output layer, simply because the number

MLP Learning for On-Chip Implementation

373

ERROR

I

0

MAXIMUM SINGLE-BIT ERROR - OUTPUT LAYER

100 200 300 400 LEARNING TIME (SIMULATION EPOCHS)

500

Figure 3: Mean output error, hidden node error, and maximum single-bit error for learning 11 vowel sounds from 5 female speakers, with noise. of outputs is large. Such errors are not acceptable in a classification problem such as this. The maximum single-bit output error is the largest of all the output bit errors, for all nodes and all patterns. Not until it falls can the network be deemed to have learned the training set completely. Over some 500 simulation (learning) epochs, the errors can be seen to be reduced - not all monotonically. The presence of the noise alluded to above is fairly obvious - and is more so in Figure 4, which shows the results of two learning experiments, with and without noise. The experiments were in every other respect identical, using the same set of randomized initial weights. The noise-free traces are smoother, but learning is protracted over the noise-injected example. A solution is found in the absence of noise, and indeed local minima were found to be rare in this set of experiments, with or without noise. The results in Figure 4 are, however, dramatic, and characteristic of other similar tests. In each case, learning was ended when the maximum single bit error dropped below 0.1, and the noise signal was reduced in magnitude when this error dropped below 0.4. Interestingly, the generalization ability is also improved by the presence of noise on the synapses and activities -

Alan F. Murray

374

I

MEAN ERROR. HIDDEN LAYER O.WI

L Y

~

LMRNING WITH NOISE LEARNING WITHOUT NOISE

-’*

MEAN ERROR OUTPUTLAYER 0 IW 200 300 4al sw SIMULATION TIME (LEARNING EPOCHS)

Figure 4 Mean output error, hidden node error, and maximum single-bit error for learning 11 vowel sounds from 5 female speakers.

by up to 5%,and the results given above are for a ”noisy” network. The hidden layer errors peak before falling, while the output errors fall more or less monotonically. This is entirely consistent with the observations regarding equation 2.5, and the competing pressures on the hidden node errors. In all, the following experiments were conducted, with the results shown in Table 1. The results presented are averaged values over several learning experiments, with standard deviations in parentheses. These generalization results are broadly similar to those obtained using standard backpropagation, with a momentum term of 0.1, on a 54:2711 network (Watts 1990), although the backpropagation learning times were considerably longer (Table 2). The backpropagation results are taken from single ”best-case” experiments, without the averaging process in Table 1,

MLP Learning for On-Chip Implementation

375

Table 1: Simulation Results - Virtual Targets Strategy Training set Test set

Learning time Generalization epochs % of test set correct mean (SD) Mean (SD) Best result

(%I ~~

5 female 10 female 5 male 10 male

~

13 female 8 female 10 male 5 male

196 (40) 280 (125) 308 (160) 186 (60)

(%I

~

~~

64 (4.3) 69 (2.4) 65 (3.0) 75 (3.5)

75 72 69 80

Table 2 Simulation Results - Backpropagation Algorithm Training set Test set

Learning time Generalization epochs % of test set correct

(96) 5 females 10 females 5 males 10 males

13 females 8 females 10 males 5 males

1452 1375 1826 1386

66 70 76 78

and as such are prone to the deleterious effects of a small database. Particularly when the training set is small (5 individuals), wide variations in learning "quality" between different experiments are to be expected. This accounts for the difference in generalization results between Table 1 and Table 2. The Virtual Targets strategy has, therefore, similar learning and generalization properties to backpropagation. This is hardly surprising, as it has the same conceptual roots. It is, however, optimized for implemen-tation in analog VLSI, as the following section indicates. In an attempt to clarify the role of noise, Figure 5 shows the effect of different levels of noise on a learning cycle with the same initial conditions. Initially, learning time is reduced by noise injection, as Figure 4 suggests. Increasing the noise level must, however, eventually swamp the data totally, and prevent the classification from being captured at all. This effect is seen in the upper trace in Figure 5, where learning times increase exponentially, at noise level of around 40%. However, the generalization ability (the measure of the quality of learning, as evidenced by the MLP's ability to classify unseen data correctly) rises essentially monotonically. Figure 5 suggests that a level of around 10-20% noise offers an optimal compromise between extended learning time for high levels of noise, and lower generalization ability for lower levels. The

5

-

500

10%

20%

30%

40%

PERCENTAGE OF SYNAITIC/ACTIVITY NOISE

Figure 5: Learning time and generalization ability as a function of injected noise level. most useful observation to be made at this stage is that corrupting the training data with noise is held to have the same effect as penalizing high curvature in decision boundaries - in other words it causes the network to draw sweeping curves through the decision space, rather than fitting convoluted curves to the individual training data points (Bishop 1990). In this way, underlying trends are modeled, while fine "irrelevant" detail is ignored. These two findings concerning noise would seem to be perfectly general in the neural context, and have ramifications for all learning processes where weights evolve incrementally, and slowly. Noise sources were inserted initially to model the noise known to be present in analog systems. As the method now stands, the noise sources are being used to improve both learning speed and quality. Analog noise takes the form of both DC and AC inaccuracies in voltages, currents, and device characteristics. DC offsets are essentially canceled out by

MLP Learning for On-Chip Implementation

377

(a) FORWARD INFORMATION R O W Synapse

Neuron

qe_l%? 141 ,.I

-

..................

w aJ' 0J'

I

Wajoj

,"

0-

aj O j

0a

(b) BACKWARD INFORMATION FLOW Upd.1~ Wab via(l)or(2)

+jij

Slme

1 ap I and

updslc YIP (3)

o a ( 1 - Oa)

=::::::::::::::::::::. ,*.I

,*,

ib iP

,=,

Wib ip

Error from layer above

Figure 6: Proposed VLSI implementation scheme for information and error flow in the forward (a) and backward (b) directions, respectively. including the chip in the learning process [i.e., "chip-in-the-loop," as described by INTEL in the context of their E T A " chip (Holler et al. 198911. Natural AC noise is in general too weak to provide the high levels of noise that give optimal results here. However, Alspector has reported an elegant solution to this problem in the context of stochastic learning systems (Alspector et al. 1991) involving a small overhead in digital feedback register circuitry. Preliminary experiments suggest that in the work reported in this paper, the exact form of the noise is not critical, and that a simplified version of Alspector's circuitry will suffice. 4 Implementation

Figure 6 shows the flow of information in a virtual targets network. The forward flow of information is that of a standard MLP. In parallel with this, however, error signals are calculated and passed backward via equations 2.1-2.3. This implies that each synapse circuit must perform backward multiplication as implied by equation 2.3, at the same time as the multiplication for C W&,. We have already demonstrated (Murray et al. 1990, 1991) that multiplication can be performed using as few as 3 MOSFETs, with the bulk of the synapse circuit devoted to weight storage. The area and complexity overhead of a "two-way," single-multiplicand multiplier is therefore slight. Also indicated in Figure 6 is the storage requirement for the (adaptive) target states on the hidden units. This could be achieved via a set of on-chip memory elements for each of the {8jp}. It is more likely to be achieved optimally via a single set of on-chip

378

Alan F. Murray

memories for {ijj}, loading a new set for each pattern p along with the inputs, and reading the adapted targets {ijjp} along with the outputs. The synapse must be capable of incrementing and decrementing its own weight, while passing state signals forward, and error signals backward. Incrementing and decrementing capacitively stored weights is not difficult (Murray 1989) and will involve only two or three MOSFETs. The update equation 2.3 for target states is extremely simple, and thus easily implemented local to the neuron circuit. Looked at critically, this is actually no less complex than backpropagation. In fact, the requirement that the current version of the target state be retrieved after each presentation and stored along with the rest of the exemplar from the training set requires an input/output pad for each of the hidden neurons. However, the distinction between hidden and output neurons has been removed to the extent that the update scheme for weights to hidden and output neurons is identical, and furthermore both sets of neurons now include a local temporary target storage memory. Chips based on the target method will therefore be architecturally flexible, as the exact role of the neurons in the network‘s function (input, output, or hidden) need not be completely determined at the chip design stage. It is this feature, rather than a significant difference in raw complexity, that renders this scheme more amenable to VLSI. We have not yet designed the chip associated with this architecture. At this early stage, however, it is useful to propose an architecture that demonstrates the advantages of target-based methods. Figure 7 shows how a 5-neuron network would be arranged on silicon. It is intended that both neural activity values { x i } and post-activatiofinction outputs (0,) be accessible as multiplexed network “inputs,” to allow maximum flexibility in how input data are presented to the device. If the network were a backpropagation MLP, the input:hidden:output node assignment would have to be defined at the design stage in order to equip the hidden node elements with the circuitry essential to the chain-rule portion of backpropagation. In a target-based chip, however, every neuron has a locally stored target value, and associated target update circuitry. It is the use of these targets that defines which nodes are input, output, or hidden. For instance, to configure this chip as a 2:2:1 network, the input target circuits’ function would be ignored as irrelevant. The output target is loaded with each training input as normal, and updated target values ignored. The hidden node targets are loaded from off-chip memory along with the input and output vectors, and updated targets are loaded back into off-chip memory at the end of each training epoch. It would be a simple matter to reconfigure the chip as a 2:1:2 network, by a redefinition of the target usage, without any change in the chip’s architecture. Clearly this advantage is much greater in a realistically larger network. I believe that target-based algorithms, of which this is only the most recent variant, have much to offer here. They are not, to be sure, truly adaptive systems, requiring as they do extra storage and control, and

MLP Learning for On-Chip Implementation

379

ALL NEURONS IDENTICAL

OFF-CHIP MEMORY FOR INPUTS, OUTPUT AND HIDDEN NODE TARGETS

Figure 7 Proposed targets chip architecture. being unable to function without supervision. However, the conceptual simplicity of the target idea lends an architectural simplicity to chips designed to support it. Furthermore, target algorithms allow uncomplicated update equations, and may even have some direct advantages as described above with respect to hill-climbing in the output error. The reset and learning speed controls on the targets themselves afford an additional means of influencing learning electronically and dynamically. We are now developing 1.5-pm CMOS devices to support target-based learning, and will report the results from these as soon as possible. 5 Conclusions

A scheme has been described that offers learning speeds and generalization ability slightly better than backpropagation, but is both conceptually simpler, and designed for flexible implementation as analog integrated circuitry. It differs from other target-based algorithms in being more pragmatic in its methods. Immunity to analog inaccuracy is high - in fact high levels of artificially introduced noise assist learning. The noise inherent in analog

380

Alan F. Murray

circuitry is therefore not a problem, and circuitry will be included to inject controlled noise into the learning arithmetic. We are currently testing the scheme's capabilities more exhaustively, in preparation for full implementation using analog pulse-firing VLSI, in conjunction with dynamic weight and target storage, and also using nonvolatile amorphous silicon memory devices (Reeder 1991).

Acknowledgments The author is grateful to Lionel Tarassenko (University of Oxford) for his encouragement (tempered with healthy scepticism), and constantly useful advice. Financial support from the Science and Engineering Research Council, and the CEC (ESPRIT BRA NERVES project) made this work possible.

References Alspector, J., Gannett, J. W., Haber, S., Parker, M. B., and Chu, R. 1991. VLSIefficient technique for generating multiple uncorrelated noise sources and its application to stochastic neural networks. I E E E Trans. Circuits Syst. 38(1), 109-123. Bishop, C. 1990. Curvature-driven smoothing in feedforward networks. PYOC. Int. Neural Networks Conf. 749-752. Grossman, T., Meir, R., and Domany, E. 1990. Learning by choice of internal representations. In Neural Information Processing Systems (NIPS) Conference 1989, pp. 73-80. Morgan Kaufmann, San Mateo, CA. Holler, M., Tam, S., Castro, H., and Benson, R. 1989. An electrically trainable artificial neural network (ETANN) with 10240 "floating gate" synapses. Int. joint Conf. Neural Networks -lJCNN89 191-196. Krogh, A., Thorbergsson, G. I., and Hertz, J. A. 1990. A cost function for internal representations. In Neural Information Processing Systems, pp. 733740. Morgan Kaufmann, San Mateo, CA. Murray, A. F. 1989. Pulse arithmetic in VLSI neural networks. I E E E MICRO 9(6), 64-74. Murray, A. F., Brownlow, M., Hamilton, A., I1 Song Han, Reekie, H. M., and Tarassenko, L. 1990. Pulse-firing neural chips for hundreds of neurons. In Neural Information Processing Systems (NIPS)Conference, pp. 785-792, Morgan Kaufmann, San Mateo, CA. Murray, A. F., Del Corso, D., and Tarassenko, L. 1991. Pulse-stream VLSI neural networks - mixing analog and digital techniques. I E E E Trans. Neural Networks 193-204. Nabutovsky, D., Grossman, T., and Domany, E. 1990. Learning by CHIR without storing internal representations. Complex Syst. 4, 519-541. Reeder, A. A., Thomas, I. P.,Smith, C., Wittgreffe, J., Godfrey, D., Hajto, J., Owen, A., Snell, A. J., Murray, A. F., Rose, M., and LeComber, P G. 1991. Appli-

MLP Learning for On-Chip Implementation

381

cation of analogue amorphous silicon memory devices to resistive synapses for neural networks. Int. Conf. Neural Networks (Munich), pp. 253-259. Rohwer, R. 1990. The ”moving targets” training algorithm. In Neural Information Processing Systems, pp. 558-565, Morgan Kaufmann, San Mateo, CA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, James L. McClelland and David E. Rumelhart, eds., pp. 318-362. The MIT Press, Cambridge, MA. Watts, S. 1990. Computation in neural networks: A comparison of the multilayer perceptron and the hierarchical pattern recognition networks for classification problems. M. Sc. Thesis, University of Oxford.

Received 5 June 1991; accepted 15 October 1991.

This article has been cited by: 1. D.J. Mayes, A.F. Murray, H.M. Reekie. 1999. Non-Gaussian kernel circuits in analogue VLSI: implications for RBF network performance. IEE Proceedings - Circuits, Devices and Systems 146:4, 169. [CrossRef] 2. Takio Kurita, Hideki Asoh, Shinji Umeyama, Shotaro Akaho, Akitaka Hosomi. 1996. Influence of noises added to hidden units on learning of multilayer perceptrons and structurization of networks. Systems and Computers in Japan 27:11, 64-73. [CrossRef] 3. Leonard Neiberg, David Casasent. 1994. High-capacity neural networks on nonideal hardware. Applied Optics 33:32, 7665. [CrossRef] 4. Thorsteinn Rögnvaldsson . 1994. On Langevin Updating in Multilayer PerceptronsOn Langevin Updating in Multilayer Perceptrons. Neural Computation 6:5, 916-926. [Abstract] [PDF] [PDF Plus]

Communicated by Todd K. Leen

Computing the Karhunen-Loeve Expansion with a Parallel, Unsupervised Filter System Reiner Lenz Mats bsterberg Department EE, Linkoping University, S-58183 Linkoping, Sweden

We use the invariance principle and the principles of maximum information extraction and maximum signal concentration to design a parallel, linear filter system that learns the Karhunen-Loeve expansion of a process from examples. In this paper we prove that the learning rule based on these principles forces the system into stable states that are pure eigenfunctions of the input process.

1 Introduction Unsupervised learning neural networks try to describe the structure of the space of input signals. One type of unsupervised learning neural networks is closely related to the traditional statistical method of principal component analysis. One of the earliest attempts to compute one eigenfunction of a class of input signals is the principal component analyzer developed by Oja (see Oja 1989 for a recent overview). Oja’s analyzer is based on the idea that the system is essentially a similarity detector or a correlation unit. This leads naturally to a Hebbian learning rule that updates the correlation coefficients so that the output of the analyzer is maximized. Oja showed that this analyzer could learn the first eigenvector of the input process. Recently, Sanger (1989) generalized Oja’s result by showing how different one-dimensional analyzers could be used to compute a number of different eigenfunctions. The system proposed by Sanger consists of a number of analyzers of the type introduced by Oja that are connected in a serial fashion: The first analyzer works on the original input signal, then the contribution from the first analyzer is subtracted from the original signal and the modified signal is fed into the next analyzer. Analyzer number i is thus trained with that part of the signal which is unexplained by the previous analyzers 1,.. . ,i - 1 of the system. Neural Computation 4, 382-392 (1992)

@ 1992 Massachusetts Institute of Technology

Computing the Karhunen-Loeve Expansion

383

In this paper we propose a system that produces output signals that preserve the structure of the input data space, that extract a maximum amount of information and that are maximally concentrated. These properties will be discussed in detail later on. The main aim of the paper is to show that the system that is constructed this way learns the principal components of the input process in parallel. The basic parts of the system are essentially the same correlation units as those used in Oja’s and Sanger’s systems. In the following we will call one such correlation unit (together with its update unit) a basic unit. The fundamental difference between the system proposed in this paper and the earlier systems lies in the learning rule that is used to update the basic units. In this paper we will show that our new learning rule makes it possible to train the basic units in parallel. We will also show that the communication between the basic units is minimal. By this we mean that the only information that is sent from unit i to unit j is the response of unit i to the actual input pattern. A given unit has therefore no information about the internal coefficients of the other units in the system. This leads to a clear modular structure of the overall system. The learning rule proposed in this paper is based on a combination of three principles: the structure preserving principle, the maximum information extraction principle, and the maximum concentration principle. The structure preserving or invariance principle is one of the oldest design principles in pattern recognition (see, for example, Lenz 1990). It is based on the idea that the measurement or feature extraction process should preserve the structures in pattern space. The standard example is rotation invariant edge detection in image processin:. Here pattern space is structured in the sense that patterns differing only by their orientations are considered to be equal. An edge detector should thus produce features that are independent of the orientation of the edge in the image. The maximum information principle is by now well known in neural network research. This principle states that the space of computed output signals should contain as much information about the space of input signals as possible. This maximum information principle was recognized as a basic principle for neural network design by Linsker (see Linsker 1988) and it was shown to be of fundamental importance in the grouptheoretical filter design strategy (see Lenz 1990). In Lenz (1990) it was also shown that the invariance principle and the maximum information principle are closely related and that they lead to identical solutions. The amount of extracted information is, however, much easier to handle algorithmically and we will therefore base our design on the requirement of maximum information extraction. In our analytical studies we showed that the solutions obtained with this approach can be described in terms of eigenfunctions of the input covariance function. These two principles are, however, not strong enough to force the system into a unique stable state: using only the maximum information

384

Reiner Lenz and Mats Osterberg

principle usually leads to basic units that correspond to arbitrary (orthogonal) mixtures of pure eigenfunctions. Therefore we propose to combine the maximum information principle with the maximum concentration principle. The idea behind this principle is the following: Assume we have two systems C and S that extract an equal amount of information from the input process. Assume further that system C tries to concentrate this information into as few output channels as possible, whereas system S tries to spread this information equally over all channels. A single output channel of system C will (in the mean) have either very strong or very weak output signals whereas a typical output channel in system S will produce medium strength signals. It should be clear that the signals produced by system C are much easier to evaluate: In the simplest case one only has to observe which units produced high outputs and which units were silent. The output signals of such a system are thus well suited for further processing by threshold units. This may be advantageous if the different units in the system communicate via noisy communication channels because in the simplest cases it may be sufficient to transmit on/off signals. We will therefore require that our system should extract a maximum amount of information from the input signals and that it should concentrate this information as much as possible. The dynamics of a similar system that also generalizes Oja’s approach was investigated by Leen. This system tries to maximize the sum of the variances of the output signals. It also penalizes the interaction between different units. The behavior of the system depends critically on the value of a free coupling constant that describes the relation between the variance part and the interaction part of the update rule. The dynamics of this system (and a related system) is investigated in detail in Leen (1990). The outline of this paper is as follows: first we will introduce the quality function that we use to measure the quality of the current set of filter coefficients. Then we describe the update rule that optimizes these coefficents by performing a gradient search. We show that this update rule leads to a parallel filter system with a minimal amount of internal communications. The main purpose of this paper is to show that the stable states of the system correspond to the pure eigenfunctions of the input process. In previous papers (see Lenz 1991; Lenz and Osterberg 1991a,b,c) we investigated this, and several similar systems based on the same ideas. We applied them to various pattern recognition problems like edge and line detection, texture analysis, and OCR reading. All these experiments showed that the system is indeed able to recognize the structures in pattern space. In the edge- and line-detection application we demonstrated, for example, that the learned filter functions are indeed the eigenfunctions of the input covariance functions. These functions recognized as optimal edge and line filters constructed in our work on group theoretical filter design.

Computing the Karhunen-Loeve Expansion

385

2 The Structure of the System

Based on the heuristical considerations described in the introduction we use a quality function of the form:

where W is the matrix of the weight coefficients of the system. The heuristical considerations are formalized in the following construction: We assume that the pattern vectors have K components and that the system consists of N basic units. The coefficients of the system are collected in the K x N matrix W. We require that all columns of this matrix have unit length, i.e., the units are represented by unit vectors. If p denotes the input vector and o the output vector then we have o = p x W. The components of u will be denoted by o k : ok is thus the output of the unit number k. The output covariance matrix S is defined as the matrix of all the expected values of the products o i . oj. This value is denoted by (oi . oj) where a pair of brackets denotes the expectation operation. Using these notations we introduce the quality function Q defined as

where Qv(W) = det S = det ((0, . u,)) measures the amount of extracted information by the variation of the output signals. The function Qc(W) measures the concentration within the output vectors; it is defined as Qc(w) = CE,(oT)(1- ( o f ) ) . If Wk(t) is the coefficient vector of unit k at time t and if w,k(t) is the Zth entry in this vector then at each iteration the following operations are performed by the basic units of the system: 0

The correlation units compute the output signals:

0

The update units compute the new state vectors of the units:

0

Later on we will see that it is convenient to have normed coefficient vectors. The update units will therefore also normalize the state vectors after each iteration.

Reiner Lenz and Mats Osterberg

386

In our implementation we use a gradient based learning rule, i.e., we select A WJk(t)proportional to the partial derivative 8 Wfk:

Q/a

Qc( W) (a/awIk)Qv (w) - Qv ( w) (a/aw~k) Qc( w) Q: ( W) We will first show that this derivative is a function of (up,) and (OkpJ). -

Computing (a/aw,)Qc(W) gives

-

a

-(o:)(l

- (0;))

h l k

= 2(okpl) - 4(o:)(okpI))

(2.3)

Since the derivative of (0:) is 2(0kpf). To compute the derivative (a/awfk)Qv(W) we note first that only the kth column and the kth line of the matrix S = ((ulol))depends on the weight Wlk. We have Qv(

w) =

(0:)

det S k

+

(OfO~)",~k)(-l)fr+'fA(''''k)

det (sk l ) f k

l#k,i#k

=

(0;)

detskk +

( % ~ k ) ( ~ , ~ k ) p ~

(2.4)

I#k,iZk

where A(i,j , k) is defined as ( 1

ifi>kandi>k (2.5)

[

0

ifik

s k k denotes the (N- 1) x (N- 1)submatrix of ((uioj)) obtained by deleting the kth row and the kth column. is the (N - 2 ) x (N - 2 ) submatrix of ((oioj)) obtained by deleting row k, row i, column j , and column k. All these submatrices (and thus pi) are independent of Wlk. The derivative is then given by

Since the derivative of (oiok)(ojok) is (ojok)(oipl) + (oiok)(ojpr). From these expressions we see that a unit can compute the increments without the knowledge of the internal weights of the other units, i.e., the increments for the unit k are functions of the input patterns, the output values, and the previous values of the weight coefficients within this unit.

Computing the Karhunen-Loeve Expansion

387

Output Odt)

Output ON(t)

Figure 1: The learning filter system

We see therefore that the learning rule can indeed be implemented in a parallel system of the form shown in Figure 1.

3 Computation of the Optimal Weight Matrix

We will now investigate the quality function Q( W) and compute its maximum points. In the following we assume that the system has already stabilized so that the weight coefficients do not change anymore. This constant weight matrix will be denoted by W. The output vector at time t is then computed as o ( t ) = p(t) x W and the covariance matrix of the output values is then given by

s = ((0iOj)) = (0’0)= (W’p’pw)= W’(p’p)w = W’TW

(3.1)

where T is the covariance matrix of the input patterns. We now assume that we know the first- and second-order statistical properties of the pattern process: especially that the mean vector is zero and the covariance matrix T is given. Our goal is to find the matrices W that have centered and normed columns and that are maximum points of the quality function: Q( W) max. We assume also that all eigenvalues of the covariance matrix T are different.

Reiner Lenz and Mats Osterberg

388

In our analysis we will use the singular value decomposition of the weight matrix W. This decomposition (see Golub and Van Loan 1986) can be described as follows: Theorem 1. Assume W is a K x N matrix with K > N . Then we can find orthogonal matrices U and V of size K and N , respectively,and a matrix D of size K x N such that W = UDV

(3.2)

For the diagonal elements d k of D we have dll 2 d z 2 2 “d 2 0 and the other elements in D are all zero. The decomposition W = UDV is called the singular value decomposition or the SVD of W. In the following we will use the term diagonal matrix for all matrices that have zero entries outside the diagonal. The matrix D in the previous theorem is therefore a diagonal matrix although it is rectangular. The N x N unit matrix will be denoted by EN. But we will also use EN for rectangular matrices if they are essentially equal to the unit matrix, that is if dii = 1 for all i = 1. . .N and dij = 0 for all i # j . This should not lead to any confusion since the size of the matrices are always clear from the context. In the next series of theorems we investigate the variation part of the quality function as a function of U, D, and V. 1. If W = UDV then the variation is independent of the V: Qv(w)= Qv(UDV) = Qv(U0)

Theorem 2.

Qv(W) = det(D’U‘TUD)= Qv(UD)

(3.3)

2. If we select W = UD and if the columns of Ware unit vectors then

w = U D = UEN

(3.4)

From equation 2.3 and Theorem 1 we get Qv( W) = det S

= det(V’D’U’TUDV) = det(D’U’TUD)

In the following we investigate thus Qv as a function of U and D alone. We select V = EN or W = UD. Since U was orthogonal we get W’W = D’U’UD = D’D. If the columns of W are unit vectors then we get W’W = D’D = EN and since the elements of D are nonnegative we see that D = E N . For our further investigations we need some facts about Gram determinants: Definition 1. Assume that al, . . . ,aL is a set of vectors of equal length and that y,! = aiaj are the scalar products between pairs of two such vectors. Then we define

Computing the Karhunen-Loeve Expansion

389

(see Grobner 1966, p . 108) the Gram determinant G(al,...,aL)of these vectors as the determinant of the matrix Y = ( yij ): G(al,... , a L ) = det (aia,) = det (yij)

= detY

(3.5)

We can now show that the variation function is a Gram determinant:

Theorem 3. There is a matrix X such that Qv(W) = G(xl,.. . , X N )

(3.6)

= det (x:xj)

X has the form X = TUENwhere ? is a diagonal matrix with nonnegative entries.

From Qv( W) = det(EhU’TUEN) we see that we may assume that T is a diagonal matrix with nonnegative diagonal elements (if it is not replace U by U = U1U2 where both Ui are orthogonal and where U1 diagonalizes T; this is possible since T is a covariance matrix and therefore positive semidefinite). We can thus find a matrix ? such that T = .f.2. Qv( W)

= det(EkU’TUEN)= det(EkU’T2UEN)= det(X’X)

Theorem 4. Assume that r1 2 . . . 2 matrix T. Then we have

Qv(W I ~ I

TN

(3.7)

are the N largest eigenvalues of the (3.8)

. - J N

Equality holds if U = EK. In the theory of Gram determinants (see Grobner 1966, p; 108) it is shown that 2

G(x1,.. .,X N ) L G(xi)G(x2,.. . ,X N ) = 11x1 11 G(x2,...,X N - I )

(3.9)

where equality holds if the vector x1 is orthogonal to all the vectors x2 . . .X N . From the previous theorem we know that Qv( W) = G(x1, .. . , X N ) where the xi are the columns of the matrix TU. Now U is orthogonal and therefore llxi112 5 71. Repeated application gives the theorem. The results derived so far are collected in the following theorem:

Theorem 5. The maximum value of the variationfunction is equal to the product of the N largest eigenvalues of the input process. This value is obtained by the filter system consisting of the eigenvectors belonging to these eigenvalues (and by any orthogonal transformation of them). In thefollowing we denote the matrix consisting of the sorted eigenvectorsof T by UO.With this notation we have: Qv(W)

I Qv(UOENV)= T I . . . ~ N

where W runs over all matrices with unit column vectors.

(3.10)

Reiner Lenz and Mats Osterberg

390

We now investigate filter systems with a SVD of the form UOENV and we will see how Qc can be used to select an optimal matrix V. We note also that the matrix UbTUo = T is the diagonal matrix with the eigenvalues since Uo consists of the eigenvectors of T.

Theorem 6. IfQ4(W) = Cr=l(oE)2 and W = UOENVthen Qc(W)

=

trace(ELU6TUOEN) - Q4( W)

=

trace(EL?EN) - Q4(W) =

N T,

-

Q4(W)

(3.11)

n=l

Proof. (3.12) n=l

n=l

and since the trace is invariant under orthogonal transformations: N

C(oi) =

trace ( ( O k o ) ) )

= traces = trace(W‘TW)

n=l

=

trace(V’E&U;TUoENV) = traCe(EhTEN)

(3.13)

Theorem 7. If W = UOENVand V = ( vij ) then (3.14)

If v, is the: mth column of V then we find that (0;) is the mth diagonal element of V’ELU~TUOENV. This is equal to VLTV, and we get (0;) =

Ck Tkvzm* Theorem 8. If TI 2 r2 2 . . . 2

T N:

(3.15) Since v is orthogonal we have: (0:)’ = Ck,/TkT/z&vf1 5 7: Ck,l v:lv~l = T:. Selecting ull = 1 and vil = 0 for i > 1 shows that this maximum value can be obtained if we select the first column of V as the first unit vector. V is orthogonal and we find therefore also that ‘u11 = 1 implies D l i = 0 for i > 1. From this we conclude that V has the form:

with an (n

c:=,(o;)~

- 1) x (n - 1) matrix V,. By induction we find that the sum = Ck,l,mT k q v z m v f m is maximal if we select = EN.

v

Finally some comments on the case where several eigenvalues are equal. For simplicity we assume that all eigenvalues are equal: T~ =

Computing the Karhunen-Loeve Expansion

. . . = TN of

v

391

= r. In this case we can see that also Q ~ ( U V O ), is independent N

Q~(UV O ), =

C (Ok)2 = C T k ~ $ m v ~ m= m=l

k , b

N

T2

C

Z ' ~ m V ~= m Nr2

(3.16)

k,l,m=l

The quality function is in this case independent of V and all systems of the form W = UoENV with arbitrary, orthogonal matrices V are maximum points.

4 Summary and Conclusions

In the last section we proved the following two properties of our filter system: 3 . The variation value is maximal for all filter systems of the form W =

UOENV, where UOis the eigenvector matrix of the input covariance matrix and V is an arbitrary orthogonal matrix. 2. For all filter systems with maximum variation value the concentration value is minimal for the filter system consisting of the pure eigenvectors . The derivation shows also that the key component in our new quality function is the fourth-order term; this term forces the system into stable states consisting of pure eigenvectors. This was shown under the assumption that the system has already stabilized but our simulation experiments show that this is also the case even if the covariance matrix of the input process has to be estimated during the learning process.

References Golub, G. H., and Van Loan, C. F. 1986. Matrix Computations. North-Oxford Academic. Grobner, W. 1966. Matrizenrechnung. B-I Hochschultaschenbiicher. Bibliographisches Institut, Mannheim. Leen, T. K. 1990. Dynamics of learning in linear feature-discovery networks. Network 2(1), 85-105. Lenz, R. 1990. Group Theoretical Methods in Image Processing. Lecture Notes in Computer Science (Vol. 413). Springer-Verlag, Berlin. Lenz, R. 1991. On probabilistic invariance. NeuTal Networks 4(5), 627-641. Lenz, R., and Osterberg, M. 1990. Learning filter systems. Proc. Int. Neural Networks Conf. Paris. Lenz, R., and Osterberg, M. 1991a. Filtering: Invariance, information extraction and learning. In Progress in Neural Networks (in press). Lenz, R., and Osterberg, M. 1991b. Learning filter systems with maximum correlation and maximum separation properties. In SPlE Proc. Applications of Artificial Neural Networks II, Orlando.

392

Reiner Lenz and Mats Osterberg

Lenz, R., and Osterberg, M. 1991~.A parallel learning filter system that learns the kl-expansion from examples. In Proc. First IEEE-SP Workshop on Neural Networks for Signal Processing, Princeton, pp. 121-130. Linsker, R. 1988. Self-organizationin a perceptual network. IEEE Cornput. 21(3),

105-117. Oja, E. 1989. Neural networks, principal components, and subspaces. Int. 1. Neural Syst. 1,61-68. Sanger, T. D. 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2(6), 459-474.

Received 6 May 1991; accepted 7 October 1991

This article has been cited by: 1. Reiner Lenz, Mats Österberg, Jouni Hiltunen, Timo Jaaskelainen, Jussi Parkkinen. 1996. Unsupervised filtering of color spectra. Journal of the Optical Society of America A 13:7, 1315. [CrossRef]

Communicated by James McClelland

Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks C. L. Giles C. B. Miller NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 USA

D. Chen H. H. Chen G. Z. Sun Y. C. Lee University of Ma yland, Institute for Advanced Computer Studies, Department of Physics and Astronomy, College Park, M D 20742 USA

We show that a recurrent, second-order neural network using a realtime, forward training algorithm readily learns to infer small regular grammars from positive and negative string training samples. We present simulations that show the effect of initial conditions, training set size and order, and neural network architecture. All simulations were performed with random initial weight strengths and usually converge after approximately a hundred epochs of training. We discuss a quantization algorithm for dynamically extracting finite state automata during and after training. For a well-trained neural net, the extracted automata constitute an equivalence class of state machines that are reducible to the minimal machine of the inferred grammar. We then show through simulations that many of the neural net state machines are dynamically stable, that is, they correctly classify many long unseen strings. In addition, some of these extracted automata actually outperform the trained neural network for classification of unseen strings. 1 Introduction

Grammatical inference, the problem of inferring grammar(s) from sample strings of a language, is a hard problem, even for regular grammars; for a discussion of the levels of difficulty see Gold (1978) and Angluin and Smith (1983). Consequently, there have been many heuristic algorithms developed for grammatical inference, which either scale poorly with the number of states of the inferred automata or require additional information such as restrictions on the type of grammar or the use of queries (Angluin and Smith 1983). For a summary of inference methods, see Neural Computation 4, 393-405 (1992)

@ 1992 Massachusetts Institute of Technology

394

C. L. Giles et al.

Fu (1982) and Angluin and Smith (1983) and the recent, comprehensive summary by Miclet (1990). The history of finite state automata and neural networks is a long one. For example, Minsky (1967) proved that ”Every finite-state machine is equivalent to, and can be simulated by some neural network.” More recently the training of first-order recurrent neural networks that recognize finite state languages was discussed by Williams and Zipser (1989), Cleeremans et al. (1989), and Elman (1990). The recurrent networks were trained by predicting the next symbol and using a truncation of the backward recurrence. Cleeremans et al. (1989) concluded that the hidden unit activations represented past histories and that clusters of these activations can represent the states of the generating automaton. Mozer and Bachrach (1990) apply a neural network approach with a second-order gating term to a query learning method (Rivest and Schapire 1987). These methods (Rivest and Shapire 1987; Mozer and Bachrach 1990) require active exploration of the unknown environments, and produce very good finite state automata (FSA) models of those environments. We discuss a recurrent neural network solution to grammatical inference and show that second-order recurrent neural networks learn fairly well small regular grammars with an infinite number of strings. This greatly expands on our previous work (Giles et al. 1990; Liu et al. 1990), which considered only regular grammars of unusual state symmetries. Our approach is similar to that of Pollack (1990)and differs in the learning algorithm (the gradient computation is not truncated) and the emphasis on what is to be learned. In contrast to Pollack (1990), we emphasize that a recurrent network can be trained to exhibit fixed-point behavior and correctly classify long, previously unseen, strings. Watrous and Kuhn (1992) illustrate similar results using another complete gradient calculation method. We also show that from different trained neural networks, a large equivalence class of FSA can be extracted. This is an important extension of the work of Cleeremans et al. (1989) where only the states of the FSA were extracted. This work illustrates a method that permits not only the extraction of the states of the FSA, but the full FSA itself. 2 Grammars

2.1 Formal Grammars and Grammatical Inference. We give a brief introduction to formal grammars and grammatical inference; for a thorough introduction, we recommend, respectively, Harrison (1978) and Fu (1982). Briefly, a grammar G is a four tuple {N,T,P,S}, where N and T are sets of nonterminals and terminals (alphabet of the grammar), P a set of production rules, and S the start symbol. For every grammar, there exists a language L, a set of strings of the terminal symbols, that the grammar generates or recognizes. There also exist automata that recognize and generate that grammar. In the Chomsky hierarchy of phrase structured

Second-Order Recurrent Neural Networks

395

grammars, the simplest grammar and its associated automata are regular grammars and FSA. This is the class of grammars we will discuss here. It is important to realize that all grammars whose string length and alphabet size are bounded are regular grammars and can be recognized and generated, maybe inefficiently, by finite state automata. Grammatical inference is concerned mainly with the procedures that can be used to infer the syntactic rules (or production rules) of an unknown grammar G based on a finite set of strings Z from L(G), the language generated by G, and possibly also on a finite set of strings from the complement of L(G) (Fu 1982). Positive examples of the input strings are denoted as 2, and negative examples as Z-. We replace the inference algorithm with a recurrent second-order neural network, and the training set consists of both positive and negative strings. 2.2 Grammars of Interest. To explore the inference capabilities of the recurrent neural net, we have chosen to study a set of seven relatively simple grammars originally created and studied by Tomita (1982) and recently by Pollack (19901, Giles etal. (19911, and Watrous and Kuhn (1991). We hypothesize that formal grammars are excellent learning benchmarks; that no feature extraction is required since the grammar itself constitutes the most primitive representation. For very complex grammars, such as the regular grammar that represents Rubik‘s cube, the feature extraction hypothesis might break down and some feature extraction method, such as diversity (Rivest and Shapire 1983, would be necessary. The grammars shown here are simple regular grammars and should be learnable. They all generate infinite languages over {0,1}* and are represented by finite state automata of between three and six states. Briefly, the languages these grammars generate can be described as follows:

#1 - l’,

#2 - ( 1 0 )*, #3 - an odd number of consecutive 1s is always followed by an even number of consecutive Os, #4 - any string not containing ”000” as a substring,

#5 - even number of 0s and even number of Is, (see Giles et LIZ. 1990, p. 383), (our interpretation of Tomita #5), #6 - number of Is - number of 0s is a multiple of 3, #7-o*

1*o* 1’.

The FSA for Tomita grammar #4 is given in Figure lc. Note that this FSA contains a so-called “garbage state,” that is, a nonfinal state in which all transition paths lead back to the same state. This means that the recurrent

396

C. L. Giles et al.

Figure 1: Finite state automata (=A) for Tomita’s 4th grammar. Initial state nodes are cross-hatch. Final state nodes are drawn with an extra surrounding circle. Transitions induced by a “0” input are shown with solid lines, and transitions induced by a “1” with dashed lines. (a,b) FSA derived from the statespace partitioning of two neural networks that learned the grammar starting from different initial weight conditions. ( c ) The ideal, minimal FSA for Tomita’s 4th grammar. Machines (a) and (b) both reduce via a minimization algorithm to machine (c). neural net must not only learn the grammar but also its complement and thus correctly classify negative examples. Not all FSA will have garbage states. Such an FSA recognizes a language only when the entire string is seen. In this case there are no situations where ”illegal characters” occur - there are no identifiable substrings that could independently cause a string to be rejected.

3 Recurrent Neural Network 3.1 Architecture. Recurrent neural networks have been shown to have powerful capabilities for modeling many computational structures;

Second-Order Recurrent Neural Networks

397

an excellent discussion of recurrent neural network models and references can be found in Hertz et al. (1991). To learn grammars, we use a secondorder recurrent neural network (Lee et al. 1986; Giles et al. 1990; Sun et al. 1990; Pollack 1990). This net has N recurrent hidden neurons labeled Sj; L special, nonrecurrent input neurons labeled Ik; and N2 x L real-valued weights labeled Wqk. As long as the number of input neurons is small compared to hidden neurons, the complexity of the network only grows as O ( p ) ,the same as a linear network. We refer to the values of the hidden neurons collectively as a state vector S in the finite N-dimensional space [0,1IN.Note that the weights Wijk modify a product of the hidden Sj and input & neurons. This quadratic form directly represents the state transition diagrams of a state process - {input,state} + {nextstate}. This recurrent network accepts a time-ordered sequence of inputs and evolves with dynamics defined by the following equations:

where g is a sigmoid discriminant function. Each input string is encoded into the input neurons one character per discrete time step t. The above equation is then evaluated for each hidden neuron Si to compute the next state vector S of the hidden neurons at the next time step t + 1. With unary encoding the neural network is constructed with one input neuron for each character in the alphabet of the relevant language. This condition might be restrictive for grammars with large alphabets. 3.2 Training Procedure. For any training procedure, one must consider the error criteria, the method by which errors change the learning process, and the presentation of the training samples. The error function €0 is defined by selecting a special “response” neuron SO, which is either on (SO > 1 - E ) if an input string is accepted, or off (SO < c) if rejected, where E is the response tolerance of the response neuron. We define two error cases: (1)the network fails to reject a negative string 2- (i.e., So > c); (2) the network fails to accept a positive string 2, (i.e., SO< 1 - c). For these studies, the acceptance or rejection of an input string is determined only at the end of the presentation of each string. The error function is defined as

where TO is the desired or target response value for the response neuron SO. The target response is defined as TO = 0.8 for positive examples and TO = 0.2 for negative. The notation Sf) indicates the final value of So, that is, after the final input symbol.

398

C. L. Giles et al.

The training is an on-line (real-time) algorithm that updates the weights at the end of each sample string presentation (assuming there is an error Eo > .5& with a gradient-descent weight update rule: (3.3)

where cr is the learning rate. We also add a momentum term, an additive update to A WI,,, which is q, the momentum, times the previous A Wlmn.To determine A Wlmn,the aS“/ldWlmn must be evaluated. From the recursive network state equation, we see that (3.4)

where 9”is the derivative of the discriminant function. In general, f and ” f - 1 can be replaced by any t and t - 1, respectively. These partial derivative terms are calculated iteratively as the equation suggests, with one iteration per input symbol. This on-line learning rule is a secondorder form of the recurrent net of Williams and Zipser (1989). The initial terms aSY’/dWl,, are set to zero. After the choice of the initial weight values, the dS~”/dW,,, can be evaluated in red time as each input 1;) enters the network. In this way, the error term is forward-propagated and accumulated at each time step t. However, each update of dS~’)/aWl,,, requires OW4 x L2) terms. For N >> L, this update is 0(N4),which is the same as a linear network. This could seriously inhibit the size of the recurrent net if it remains fully interconnected. 3.3 Presentation of Training Samples. The training data consist of a series of stimulus-response pairs, where the stimulus is a string over {O,l}*, and the response is either ”1” for positive examples or ”0” for negative examples. The positive and negative strings, Z+and I-,are generated by a source grammar prior to training. Recall that at each discrete time step, one symbol from the string is presented to the neural network. There was no total error accumulation as occurs in batch learning; training occurred after each string presentation. The sequence of strings during training may be very important. To avoid too much bias (such as short versus long, positive versus negative), we randomly chose the initial training set of 1024 strings, including Tomita’s original set, from the set of all strings of length less than 16 (65,535 strings). As the network starts training, the network gets to see only some small randomly-selected fraction of the training data, about 30 strings. The remaining portion of the data is called “pretest” training data, which the network gets to see only after it either classifies all 30 examples correctly (i.e., for all strings ]El < 4,or reaches a maximum number of epochs (one epoch = the period during which the network

Second-Order Recurrent Neural Networks

399

processes each string once). This total maximum number of epochs is 5000 and is set before training. When either of these conditions is met, the network checks the pretest data. The network may add up to 10 misclassified strings in the pretest data. This prevents the training procedure from driving the network too far toward any local minima that the misclassified strings may represent. Another cycle of epoch training begins with the augmented training set. If the net correctly classifies all the training data, the net is said to converge. This is a rather strict sense of convergence. The total number of cycles that the network is permitted to run is also limited, usually to about 20. An extra end symbol is added to the string alphabet to give the network more power in deciding the best final state S configuration. For encoding purposes this symbol is simply considered as another character and requires another input neuron. Not that this does not increase the complexity of the FSA! In the training data, the end symbol appears only at the end of each string.

3.4 Extracting State Machines. As the network is training (or after training), we apply a procedure for extracting what the network has learned - that is, the network's current conception of the FSA it is learning (or has learned). The FSA extraction process includes the following steps: (1) clustering of FSA states, (2) constructing a transition diagram by connecting these states together with the alphabet labeled arcs, (3) putting these transitions together to make the full digraph-forming loops, and (4) reducing the digraph to a minimal representation. The hypothesis is that during training, the network begins to partition (or quantize) its state space into fairly well-separated, distinct regions or clusters, which represent corresponding states in some finite state automaton (see Fig. 1). See Cleeremans et al. (1989) for another clustering method. One simple way of finding these clusters is to divide each neuron's range [0,1] into q partitions of equal width. Thus for N hidden neurons, there exist 9N possible partition states. The FSA is constructed by generating a state transition diagram, that is, associating an input symbol with the partition state it just left and the partition state it activates. The initial partition state, or start state of the FSA, is determined from the initial value of S('=O). If the next input symbol maps to the same partition state value, we assume that a loop is formed. Otherwise, a new state in the FSA is formed. The FSA thus constructed may contain a maximum of 9N states; in practice it is usually much less, since not all partition states are reached by S(f). Eventually this process must terminate since there are only a finite number of partitions available; and, in practice, many of the partitions are never reached. The derived FSA can then be reduced to its minimal FSA using standard minimization algorithms (Hopfcroft and Ullman 1979). (This minimization process does not change the performance of the FSA; the unminimized FSA has the same time complexity as the minimized FSA. The process just rids the FSA of redundant, unnecessary states and

400

C. L. Giles et al.

reduces the space complexity.) The initial value of the partition parameter is 9 = 2 and is increased only if the extracted FSA fails to correctly classify the 1024 training set. It should be noted that this FSA extraction method may be applied to any discrete-time recurrent net, regardless of order or hidden layers. Of course this simple partitioning or clustering method could prove difficult for large numbers of neurons. 4 Results - Simulations

At the beginning of each run, the network is initialized with a set of random weights, each weight chosen between [-l.O,l.O]. Unless otherwise noted, each training session has its own unique initial weight conditions. The initial value of the neurons Sir=') = 6io; though simulations with these values chosen randomly on the interval [0.0,1.Ol showed little significant difference in convergence times. For on-line training the initial hidden neuron values are never reset. The hidden neuron values update as new inputs are seen and when weights change due to string misclassification. The simulations shown in Table 1 focus on Tomita’s grammar #4. However, our studies of Tomita’s other grammars suggest that the results presented are quite general and apply to any grammar of comparable complexity. Column 1 is a run identification number. The only variables were the number of hidden neurons (3,4,5), in column 2, and the unique random initial weights. In column 3 the initial training set size is 32. Column 4 contains the initial training set size plus all later errors made on the training set. The number of epochs for training is shown in Column 5. If the network does not converge in 5000 epochs, we say the network has failed to converge on a grammar. The number of errors in the test set (both positive and negative) consisting of the original universe of all strings up to length 15 (65,535-1024 strings) is shown in columns 6, 6a, and 6b, and is a measure of the generalization capacity of the trained net. For columns 6 and 6a the error tolerance t is respectively < 0.2 and < 0.5. As expected if the error tolerance is relaxed, the trained network correctly classifies significantly more of the 65K test set. In column 6b are the number of errors for a randomly chosen set of 850,000 strings of length 16 to 99 with an error tolerance < 0.5. The information related to the extraction of FSA from the trained neural network is in columns 7-9. The number of neuron partitions (or quantizations) necessary to obtain the FSA that correctly recognizes the original training set is shown in column 7. The partition parameter q is not unique and all or many values of 9 will actually produce the same minimal FSA if the grammar is well learned. The number of states of the unminimized extracted FSA is shown in column 8. The extracted FSA is minimized and in column 9 the number of states of the minimal extracted FSA is shown. The minimal FSA for the grammar Tomita #4 is 4 states (see Fig. lc) if the empty string is accepted and 5 states if the empty string is rejected.

-

S P P P

P P P P P -

-

01 9 9 S 01 P 9 9

P

E E E E Z Z

E

S Z -

-

Z Z Z

P 8 L ZI 9

E

El

P P P

8

S

El

s

01

S

-

-

P P P

E E Z

E Z

0 0

-

0 98

09E 0 0 0 0

ZLo09 0 0 EZ1

-

-

PE 61I S8 Z 0

0 0 0 0 0

0 0

0 99

0 OE

I

PZ

0 0 0

0

z

I 996 Z 0 0

01

0 0 0

006 0 0 19 8 28 0 082 0

A

1z $.

88PI P LLlZ 0 1SE LZPZ

- L S LI S IZ P 8 P Zl P P 9 - -

F

2P 'm Lit

20

2

::5.

E

e)cI

*3 *3

Y

9.

ZE

zs

ZE

ZP

ZE ZE

ES

SP

ZE

s9

ZE

-

-

PW3 PIP3

ZE

ZP ZS 9s ZP

PW3 PSO I 19 L6 P6 PZ I z1I 98 sz1 96

ZOI 1 OL €01 EL

PL

zs

ZE ZE ZE

ZE ZE ZE ZE ZE ZE ZE

99

ES

ZS

ZP PS

zz I

zs

-

E E E E E E E E E E

-

P P P P P P P P P P P

__ -

s

LE ZE ZE ZE

PS LP PS ZP

S

ZE

zs

P

Q J!

a

D

ZE ZE

ZE ZE

ES

PZI PZ 1 LO1 LL SZ I EE I 0s 6P 88 Z8

S S S

S LE ZP --

! z S

8 G

I

kt

a q

8

m

I

!.mi

Y

I+'

*

P

L

3. x L 6

6S 8P €01 zo1 09 8s __.

r

E F ii

3

f

$ 2.

b!

D

3

g

1

a

iB

t

rI v

402

C. L. Giles et al.

The empty string was not used in the training set; consequently, the neural net did not always learn to accept the empty string. However, it is straightforward to include the empty string in the training set. In Figure l a and b are the extracted FSA for two different successful training trial runs (#lo4 and #104b in Table 1) for a four-neuron neural network. The only difference between the two trials is the initial weight values. The minimized FSA (Hopfcroft and Ullman 1979) for Figure l a and b is shown in Figure lc. All states in the minimized FSA are final states with the exception of state 0, which is a garbage state. For both cases and in all trials in Table 1 that converged, the minimized extracted FSA is the same as the minimal FSA of Tomita #4. What is interesting is that some extracted FSA, for example, trial Runs #60 and #104e, will correctly classify all unseen strings whereas the trained neural networks, from which the FSA were extracted, will not.

5 Conclusions

Second-order recurrent neural networks are capable of learning small regular grammars rather easily and generalizing very well on unseen grammatical strings. The training results of these neural networks for small simple grammars are fairly independent of the initial values of the weight space and usually converge using an incremental on-line, forward-propagation, training algorithm. For a well-trained neural net, the generalization performance on long (string lengths < 100) unseen strings can be perfect. A heuristic method was used to extract FSA from the neural network, both during and after training. (It would be interesting if a neural network could also learn to extract the proper FSA.) Using a standard FSA minimization algorithm, the extracted FSA can be reduced to an equivalent minimal-state FSA. Note that the minimization procedure reduces only the space complexity of the FSA; the time complexity of the minimized and unminimized FSA remains the same. From the extracted FSA, minimal or not, the production rules of the learned grammar are evident. There are some interesting aspects to the extracted FSA. Surprisingly, each of the unminimized FSA shown in the table is unique, even those with the same number of states (i.e., see Runs #105b,d,i,j). For the simple grammar Tomita #4, nearly all networks converged during training (learned the complete training set). For all cases that converged, it is possible to extract state machines that are perfect, i.e., the FSA of the unknown source grammar. For these cases the minimized, extracted FSA with the same number of states constitute a large equivaZence cZass of neuralnet-generated FSA, that is, all unminimized FSA are equivalent and have the same performance on string classification. This equivalence class extends across neural networks that vary both in size (number of neurons) and

Second-Order Recurrent Neural Networks

403

initial conditions. Thus, the extracted FSA give some indication of how well the neural network learns the grammar. In fact, for some of the well-trained neural nets, for example, run #104, all extracted, minimized FSA for a large range of partition parameters (250) are the same as the ideal FSA of the source grammar. We speculate that for these well-trained neural nets, the extracted, minimal FSA will be independent of the choice of the partition parameter. These perfect FSA outperform some of the trained neural networks in correct classification of unseen strings. (By definition, a perfect FSA will correctly classify all unseen strings.) This is not surprising due to the possibility of error accumulation as the neural network classifies long unseen strings (Pollack 1990). However, when the neural network has learned the grammar well, its generalization performance is also perfect (for all strings tested). Thus, the neural network can be considered as a tool for extracting an FSA that is representative of the unknown grammar. Once the FSA is extracted, it can be used independently of the trained neural network. Can we make any arguments regarding neural net capacity and scalability? In our simulations the number of states of the minimal FSA that was extracted was comparable to the number of neurons in the network; but the actual extracted, unminimized FSA had many more states than neurons. However, for Runs #105e and #104h the neural network actually learned an elegant solution, the perfect FSA of the grammar (no minimization was necessary). The question of FSA state capacity and scalability is unresolved. Further work must show how well these approaches can model grammars with large numbers of states and what FSA state capacity of the neural net is theoretically and experimentally reasonable. How a complete-gradient calculation approach using second-order recurrent networks compares to other gradient-truncation, first-order methods (Cleeremans et al. 1989; Elman 1990) is another open question. Surprisingly, a simple clustering approach derives useful and representative FSA from a trained or training neural network. Acknowledgments We would like to acknowledge useful discussions with M. W. Goudreau, S. J. Hanson, G. M. Kuhn, J. McClelland, J. B. Pollack, E. Sontag, D. S. Touretzky, and R. L. Watrous. The University of Maryland authors gratefully acknowledge partial support through giants from AFOSR and DARPA. References Angluin, D., Smith, C . H. 1983. Inductive inference: Theory and methods. ACM Comput. Sum. 15(3), 237.

404

C. L. Giles et al.

Cleeremans, A., Servan-Schreiber, D., and McClelland, J. 1989. Finite state automata and simple recurrent networks. Neural Comp. 1(3), 372. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179. Fu, K. S. 1982. Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, NJ. Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., and Chen, D. 1990. Higher order recurrent networks and grammatical inference. In Advances in Neural Information Systems 2, D. S. Touretzky, ed., p. 380. Morgan Kaufmann, San Mateo, CA. Giles, C. L., Chen, D., Miller, C. B., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1991. Grammatical inference using second-order recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks, IEEE 91CH3049-4, Vol. 2, p. 357. IEEE. Gold, E. M. 1978. Complexity of automaton identification from given data. Inform. Control 37, 302. Harrison, M. H. 1978. Introduction to Formal Language Theory. Addison-Wesley, Reading, MA. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation, p. 163. Addison-Wesley, Redwood City, CA. Hopfcroft, J. E., and Ullman, J. D. 1979. Introduction to Automata Theory,Languages and Computation, p. 68. Addison-Wesley, Reading, MA. Lee, Y. C., Doolen, G., Chen, H. H., Sun, G. Z., Maxwell, T., Lee, H. Y., and Giles, C. L. 1986. Machine learning using a higher order correlational network. Physica D 22-DU-31, 276. Lu, Y. D., Sun, G. Z., Chen, H. H., Lee, Y. C., and Giles, C. L. 1990. Grammatical inference and neural network state machines. In Proceedings of the International Joint Conference on Neural Networks, IJCNN-90-WASH-DC, Vol. I, p. 285. Lawrence Erlbaum, Hillsdale, NJ. Miclet, L. 1990. Grammatical inference. In Syntactic and Structural Pattern Recognition Theoryand Applications, H. Bunke and A. Sanfeliu, eds., Chap. 9. World Scientific, Singapore. Minsky, M. L. 1967. Computation: Finiteand InfiniteMachines, Chap. 3.5. PrenticeHall, Englewood Cliffs, NJ. Mozer, M. C., and Bachrach, J. 1990. Discovering the structure of a reactive environment by exploration. Neural Comp. 2(4), 447. Pollack, J. B. 1990. The Induction of Dynamical Recognizers. Tech. Rep. 90-JPAutomata, Dept. of Computer and Information Science, Ohio State University. Rivest, R. L., and Schapire, R. E. 1987. Diversity-based inference of finite automata. Proc. Twenty-Eight Annu. Symp. Found. Comput. Sci., p. 78. Sun, G. Z., Chen, H. H., Giles, C. L., Lee, Y. C., and Chen, D. 1990. Connectionist pushdown automata that learn context-free grammars. In Proceedings of the International Joint Conference on Neural Networks, IJCNN-90-WASH-DC, Vol. I, p. 577. Lawrence Erlbaum, Hillsdale, NJ. Tomita, M. 1982. Dynamic construction of finite-state automata from examples using hill-climbing. Proc. Fourth Annu. Cogn. Sci. Conf., p. 105.

Second-Order Recurrent Neural Networks

405

Watrous, R. L., and Kuhn, G. M. 1992. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4(3), 406-414. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1(2), 270. Received 6 June 1991; accepted 29 October 1991.

This article has been cited by: 1. Jinmiao Chen, N.S. Chaudhari. 2009. Segmented-Memory Recurrent Neural Networks. IEEE Transactions on Neural Networks 20:8, 1267-1280. [CrossRef] 2. Ueli Rutishauser, Rodney J. Douglas. 2009. State-Dependent Computation Using Coupled Recurrent NetworksState-Dependent Computation Using Coupled Recurrent Networks. Neural Computation 21:2, 478-509. [Abstract] [Full Text] [PDF] [PDF Plus] 3. E. Kolman, M. Margaliot. 2008. A New Approach to Knowledge-Based Design of Recurrent Neural Networks. IEEE Transactions on Neural Networks 19:8, 1389-1401. [CrossRef] 4. Miguel Delgado, Manuel P. Cuellar, Maria Carmen Pegalajar. 2008. Multiobjective Hybrid Optimization and Training of Recurrent Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:2, 381-403. [CrossRef] 5. Dezhe Z Jin. 2008. Decoding spatiotemporal spike sequences via the finite state automata dynamics of spiking neural networks. New Journal of Physics 10:1, 015010. [CrossRef] 6. Henrik Jacobsson. 2006. The Crystallizing Substochastic Sequential Machine Extractor: CrySSMExThe Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx. Neural Computation 18:9, 2211-2255. [Abstract] [PDF] [PDF Plus] 7. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 8. André Grüning. 2006. Stack-like and queue-like dynamics in recurrent neural networks. Connection Science 18:1, 23-42. [CrossRef] 9. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 10. Zhi-Hua Zhou. 2004. Rule extraction: Using neural networks or for neural networks?. Journal of Computer Science and Technology 19:2, 249-253. [CrossRef] 11. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 12. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of Recurrent Neural Networks. IEEE Transactions on Neural Networks 15:1, 6-15. [CrossRef]

13. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 14. Wolfgang Maass , Thomas Natschläger , Henry Markram . 2002. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on PerturbationsReal-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation 14:11, 2531-2560. [Abstract] [PDF] [PDF Plus] 15. Stephen José Hanson , Michiro Negishi . 2002. On the Emergence of Rules in Neural NetworksOn the Emergence of Rules in Neural Networks. Neural Computation 14:9, 2245-2268. [Abstract] [PDF] [PDF Plus] 16. Chia-Feng Juang. 2002. A TSK-type recurrent fuzzy network for dynamic systems processing by neural network and genetic algorithms. IEEE Transactions on Fuzzy Systems 10:2, 155-170. [CrossRef] 17. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 18. J.C. Principe, V.G. Tavares, J.G. Harris, W.J. Freeman. 2001. Design and implementation of a biologically realistic olfactory cortex in analog VLSI. Proceedings of the IEEE 89:7, 1030-1051. [CrossRef] 19. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 20. Edward Kei Shiu Ho , Lai Wan Chan . 2001. Analyzing Holistic Parsers: Implications for Robust Parsing and SystematicityAnalyzing Holistic Parsers: Implications for Robust Parsing and Systematicity. Neural Computation 13:5, 1137-1170. [Abstract] [PDF] [PDF Plus] 21. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 22. A. Paccanaro, G.E. Hinton. 2001. Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering 13:2, 232-244. [CrossRef] 23. R.C. Carrasco, M.L. Forcada. 2001. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering 13:2, 148-156. [CrossRef]

24. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 25. A. Blanco, M. Delgado, M. C. Pegalajar. 2000. Extracting rules from a (fuzzy/crisp) recurrent neural network using a self-organizing map. International Journal of Intelligent Systems 15:7, 595-621. [CrossRef] 26. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 12:1, 126-140. [CrossRef] 27. Morten H Christiansen, Nick Chater. 1999. Toward a Connectionist Model of Recursion in Human Linguistic Performance. Cognitive Science 23:2, 157-205. [CrossRef] 28. N. Srinivasa, N. Ahuja. 1999. A topological and temporal correlator network for spatiotemporal pattern learning, recognition, and recall. IEEE Transactions on Neural Networks 10:2, 356-371. [CrossRef] 29. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 30. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE 87:9, 1623-1640. [CrossRef] 31. Chun-Hsien Chen, V. Honavar. 1999. A neural-network architecture for syntax analysis. IEEE Transactions on Neural Networks 10:1, 94-114. [CrossRef] 32. A.B. Tickle, R. Andrews, M. Golea, J. Diederich. 1998. The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Transactions on Neural Networks 9:6, 1057-1068. [CrossRef] 33. S. Das, O. Olurotimi. 1998. Noisy recurrent neural networks: the continuous-time case. IEEE Transactions on Neural Networks 9:5, 913-936. [CrossRef] 34. P. Frasconi, M. Gori, A. Sperduti. 1998. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9:5, 768-786. [CrossRef] 35. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef] 36. Yoram Singer . 1997. Adaptive Mixtures of Probabilistic TransducersAdaptive Mixtures of Probabilistic Transducers. Neural Computation 9:8, 1711-1733. [Abstract] [PDF] [PDF Plus]

37. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 38. Ephraim Nissan, Hava Siegelmann, Alex Galperin, Shuky Kimhi. 1997. Upgrading automation for nuclear fuel in-core management: From the symbolic generation of configurations, to the neural adaptation of heuristics. Engineering with Computers 13:1, 1-19. [CrossRef] 39. Hava T. Siegelmann. 1996. RECURRENT NEURAL NETWORKS AND FINITE AUTOMATA. Computational Intelligence 12:4, 567-574. [CrossRef] 40. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 41. Christian W. Omlin, C. Lee Giles. 1996. Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid DiscriminantsStable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants. Neural Computation 8:4, 675-696. [Abstract] [PDF] [PDF Plus] 42. Paolo Frasconi, Marco Gori, Marco Maggini, Giovanni Soda. 1996. Representation of finite state automata in Recurrent Radial Basis Function networks. Machine Learning 23:1, 5-32. [CrossRef] 43. Y. Bengio, P. Frasconi. 1996. Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks 7:5, 1231-1249. [CrossRef] 44. J.-P.S. Draye, D.A. Pavisic, G.A. Cheron, G.A. Libert. 1996. Dynamic recurrent neural networks: a dynamical analysis. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:5, 692-706. [CrossRef] 45. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef] 46. Kam-Chuen Jim, C.L. Giles, B.G. Horne. 1996. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks 7:6, 1424-1438. [CrossRef] 47. Giovanna Castellano, Anna Maria Fanelli, Marcello Pelillo. 1995. Iterative pruning in second-order recurrent neural networks. Neural Processing Letters 2:6, 5-8. [CrossRef] 48. R. Alquézar , A. Sanfeliu . 1995. An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural NetworksAn Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks. Neural Computation 7:5, 931-949. [Abstract] [PDF] [PDF Plus] 49. Peter Tiňo , Jozef Šajda . 1995. Learning and Extracting Initial Mealy Automata with a Modular Neural Network ModelLearning and Extracting Initial Mealy Automata with a Modular Neural Network Model. Neural Computation 7:4, 822-844. [Abstract] [PDF] [PDF Plus]

50. Jun Tani, Naohiro Fukumura. 1995. Embedding a grammatical description in deterministic chaos: an experiment in recurrent neural learning. Biological Cybernetics 72:4, 365-370. [CrossRef] 51. Peter Manolios , Robert Fanelli . 1994. First-Order Recurrent Neural Networks and Deterministic Finite State AutomataFirst-Order Recurrent Neural Networks and Deterministic Finite State Automata. Neural Computation 6:6, 1155-1173. [Abstract] [PDF] [PDF Plus] 52. John L. Johnson. 1994. Pulse-coupled neural nets: translation, rotation, scale, distortion, and intensity signal invariance for images. Applied Optics 33:26, 6239. [CrossRef] 53. Zheng Zeng , Rodney M. Goodman , Padhraic Smyth . 1993. Learning Finite State Machines With Self-Clustering Recurrent NetworksLearning Finite State Machines With Self-Clustering Recurrent Networks. Neural Computation 5:6, 976-990. [Abstract] [PDF] [PDF Plus] 54. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]

Communicated by James McClelland

Induction of Finite-State Languages Using Second-Order Recurrent Networks Raymond L. Watrous Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540 USA Gary M.Kuhn Center for C o ~ m ~ n j c a f i oResearch, ns IDA, Thanet Road, Princefon, N]08540 USA

Second-order recurrent networks that recognize simple finite state languages over {0,1}* are induced from positive and negative examples. Using the complete gradient of the recurrent network and sufficient training examples to constrain the definition of the language to be induced, solutions are obtained that correctly recognize strings of arbitrary length. 1 Introduction We address the problem of inducing languages from examples by considering a set of finite state languages over (0,l)" that were selected for study by Tomita (1982):

L1. 1* L2. (lo)*

L3. no odd-length 0-string anywhere after an odd-length 1-string L4. not more than 2 0s in a row L5. bit pairs, # O h

+ #1Os = 0 mod 2

L6. abs(#ls - #Os) = 0 mod 3

L7. 0*1*0*1* Tomita also selected for each language a set of positive and negative examples (summarized in Table 1) to be used as a training set. By a method of heuristic search over the space of finite state automata with up to eight states, he was able to induce a recognizer for each of these languages (Tomita 1982). Recognizers of finite-state languages have also been induced using first-order recurrent connectionist networks (Elman 1990; Williams and Zipser 1988; Cleeremans et al. 1989). Generally speaking, these results Neural Computation 4, 406-414 (1992) @ 1992 Massachusetts Institute of Technology

Induction of Finite-State Languages

407

were obtained by training the network to predict the next symbol (Cleeremans et al. 1989; Williams and Zipser 1988), rather than by training the network to accept or reject strings of different lengths. Several training algorithms used an approximation to the gradient (Elman 1990; Cleeremans et aZ. 1989) by truncating the computation of the backward recurrence. The problem of inducing languages from examples has also been approached using second-order recurrent networks (Pollack 1990; Giles e f al. 1990). Using a truncated approximation to the gradient, and Tomita's training sets, Pollack reported that "none of the ideal languages were induced (Pollack 1990). On the other hand, a Tomita language has been induced using the complete gradient (Giles etal. 1991). The present paper also reports the induction of several Tomita languages using the complete gradient with certain differences in method from Giles et al. (1991). 2 Method

2.1 Architecture. The network model consists of one input unit, one threshold unit, N state units, and one output unit. The output unit and each state unit receive a first-order connection from the input unit and the threshold unit. In addition, each of the output and state units receives a second-order connection for each pairing of the input and threshold unit with each of the state units. For N = 3, the model is mathematically identical to that used by Pollack (1990); it has 32 free parameters. 2.2 Data Representation. The symbols of the language are represented by byte values, that are mapped into real values between 0 and 1 by dividing by 255. Thus, the ZERO symbol is represented by octal 040 (0.1255). This value was chosen to be different from 0.0, which is used as the initial condition for all units except the threshold unit, which is set to 1.0. The ONE symbol was chosen as octal 370 (0.97255). All strings are terminated by two occurrences of a termination symbol that has the value 0.0.

2.3 Training. The Tomita languages are characterized in Table 1 by the number of grammatical strings of length 10 or less (out of a total of 2047 strings). The Tomita training sets are also characterized by the number of grammatical strings of length 10 or less included in the training data. For completeness, Table 1 also shows the number of grammatical strings in the training set of length greater than 10. A comparison of the number of grammatical strings with the number included in the training set shows that while Languages 1 and 2 are very sparse, they are almost completely covered by the training data, whereas Languages 3-7 are more dense, and are sparsely covered by the training sets. Possible consequences of these differences are considered in discussing the experimental results.

Raymond L. Watrous and Gary M. Kuhn

408

Table 1: Number of Grammatical and Ungrammatical Strings of Length 10 or Less for Tomita Languages and Number of Those Included in the Tomita Training Sets. Grammatical strings

Ungrammaticalstrings

Length 5 10 Longer strings Length 5 10 Longer strings Language Total Training in training set Total Training in training set 1 2 3 4 5 6 7

11 6 652 1103 683 683 561

9 5 11 10 9 10 11

1 2 1

2

2036 2041 1395 944 1364 1364 1486

8 10 11 7 11 11 6

1 2 1 1 2

A mean-squared error measure was defined with target values of 0.9 and 0.1 for accept and reject, respectively. The target function was weighted so that error was injected only at the end of the string, The complete gradient of this error measure for the recurrent network was computed by a method of accumulating the weight dependencies backward in time (Watrous ef al. 1990). This is in contrast to the truncated gradient used by Pollack (1990) and to the forward-propagation algorithm used by Giles et al. (1991). The networks were optimized by gradient descent using the BFGS algorithm (Luenberger 1984). A termination criterion of was set; it was believed that such a strict tolerance might lead to smaller loss of accuracy on very long strings. No constraints were set on the number of iterations. Five networks with different sets of random initial weights were trained separately on each of the seven languages described by Tomita using exactly his training sets (Tomita 19821, including the null string. The training set used by Pollack (1990) differs only in not including the null string (Pollack 1991). 2.4 Testing. The networks were tested on the complete set of strings up to length 10. Acceptance of a string was defined as the network having a final output value of greater than 0.9 - T and rejection as a final value of less than 0.1 T, where 0 5 T <: 0.4 is the tolerance. The decision was considered ambiguous otherwise.

+

3 Results

The results of the first experiment are summarized in Table 2. For each language, each network is listed by the seed value used to initialize the

Induction of Finite-State Languages

409

random weights. For each network, the iterations to termination are listed, followed by the minimum MSE value reached. Also listed is the percentage of strings of length 10 or less that were correctly recognized by the network, and the percentage of strings for which the decision was uncertain at a tolerance of 0.0. The number of iterations until termination varied widely, from 28 to 37,909. There is no obvious correlation between number of iterations and minimum MSE. 3.1 Language 1. It may be observed that Language 1 is recognized correctly by two of the networks (seeds 72 and 987235) and nearly correctly by a third (seed 239). This latter network failed on the strings l9 and I"', both of which were not in the training set. The network of seed 72 was further tested on all strings of length 15 or less and made no errors. This network was also tested on a string of 100 ones and showed no diminution of output value over the length of the string. When tested on strings of 99 ones plus either an initial zero or a final zero, the network also made no errors. Another network, seed 987235, made no errors on strings of length 15 or less but failed on the string of 100 ones. The hidden units broke into oscillation after about the 30th input symbol and the output fell into a low amplitude oscillation near zero. 3.2 Language 2. Similarly, Language 2 was recognized correctly by two networks (seeds 89340 and 987235) and nearly correctly by a third network (seed 104). The latter network failed only on strings of the form (10)*010,none of which was included in the training data. The networks that performed perfectly on strings up to length 10 were tested further on all strings up to length 15 and made no errors. These networks were also tested on a string of 100 alternations of 1 and 0, and responded correctly. Changing the first or final zero to a one caused both networks correctly to reject the string. 3.3 The Other Languages. For most of the other languages, at least one network converged to a very low MSE value. However, networks that performed perfectly on the training set did not generalize well to a definition of the language. For example, for Language 3, the network at termination, yet the perwith seed 104 reached a MSE of 8 x formance on the test set was only 78.31%. One interpretation of this outcome is that the intended language was not sufficiently constrained by the training set. In the case of Language 5, in no case was the MSE reduced below 0.02. We believe that the model is sufficiently powerful to compute the language. It is possible, however, that the power of the model is marginally

Raymond L. Watrous and Gary M. Kuhn

410

Table 2: Results of Training Three State-Unit Network from 5 Random Starts on Tomita Languages Using Tomita Training Data. ~

~

Language

~~

MSE Accuracy Uncertainty

Seed

Iterations

1

72 104 239 89340 987235

28 95 8707 5345 994

0.0012500000 0.0215882357 0.0005882353 0.0266176471 0.0000000001

100.00 78.07 99.90 66.93 100.00

0.00 20.76 0.00 0.00 0.00

2

72 104 239 89340 987235

5935 4081 807 1084 10706

0.0005468750 0.0003906250 0.0476171875 0.0005468750 0.0001562500

93.36 99.80 62.73 100.00 100.00

4.93 0.20 37.27 0.00 0.00

3

72 104 239 89340 987235

442 37909 9264 8250 5769

0.0149000000 0.0000000008 0.0087000000 0.0005000000 0.0136136712

47.09 78.31 74.60 73.57 50.76

33.27 0.15 11.87 0.00 23.94

4

72 104 239 89340 987235

8630 60 2272 10680 324

0.0004375001 0.0624326924 0.0005000004 0.0003750001 0.0459375000

52.71 20.86 55.40 60.92 22.62

6.45 50.02 9.38 15.53 77.38

5

72 104 239 89340 987235

890 368 1422 2775 2481

0.0526912920 0.0464772727 0.0487500000 0.0271525856 0.0209090867

34.39 45.92 31.46 46.12 66.83

63.80 41.62 36.93 22.52 2.49

6

72 104 239 89340 987235

524 332 1355 8171 306

0.0788760972 0.0789530751 0.0229551248 0.0001733280 0.0577867426

0.05 0.05 31.95 46.21 37.71

99.95 99.95 47.04 5.32 24.87

7

72 104 239 89340 987235

373 8578 969 4259 666

0.0588385157 0.0104224185 0.0211073814 0.0007684520 0.0688690476

9.38 55.74 52.76 54.42 12.55

86.08 17.00 26.58 0.49 74.94

Induction of Finite-State Languages

411

Table 3: Results of Training Three State-Unit Network from 5 Random Starts on Tomita Language 4 Using Probabilistic Training Data ( p = 0.1). Seed Iterations 72 104 239 89340 987235

215 665 205 5244 2589

MSE Accuracy Uncertainty 0.0000001022

100.00

0.0000000001 0.0000000001

99.85 99.90 99.32 92.13

0.0005731708 0.0004624581

0.00 0.05 0.10 0.10

6.55

sufficient, so that finding a solution depends critically on the initial conditions.

4 Further Experiments

The effect of additional training data was investigated by creating training sets in which each string of length 10 or less is randomly included with a fixed probability p . Thus, for p = 0.1 approximately 10% of 2047 strings are included in the training set. A flat random sampling of the lexicographic domain may not be the best approach, however, since grammaticality can vary nonuniformly, as illustrated for Language 4 in Figure 1. The same networks as before were trained on the larger training set for Language 4, with the results listed in Table 3. Under these conditions, a network solution was obtained that generalizes perfectly to the test set (seed 72). This network also made no errors on strings up to length 15. However, very low MSE values were again obtained for networks that do not perform perfectly on the test data (seeds 104 and 239). Network 239 made two ambiguous decisions that would have been correct at a tolerance value of 0.23. Network 104 incorrectly accepted the strings 000 and 1000 and would have correctly accepted the string 0100 at a tolerance of 0.25. Both networks made no additional errors on strings up to length 15. The training data may still be slightly indeterminate. Moreover, the few errors made were on short strings, that are not included in the training data. Since this network model is continuous, and thus potentially infinite state, it is perhaps not surprising that the successful induction of a finitestate language seems to require more training data than was needed for Tomita’s finite-state model (Tomita 1982). The effect of more complex models was investigated for Language 5 using a network with 11 state units; this increases the number of weights

412

Raymond L. Watrous and Gary M. Kuhn

Figure 1: Graphic Representation of all strings in {O,l}", n 5 10, that are grammatical in Language 4. In each concentric ring n, the 2ir radians are divided up into 2" regions, one region for each string in (0, l}", in lexicographic order, starting at 0 = 0. The ring radii are adjusted so that each region has equal area: white regions are grammatical strings, black regions are ungrammatical.

from 32 to 288. Networks of this type were optimized from 5 random initial conditions on the original training data. The results of this experiment are summarized in Table 4. By increasing the complexity of the model, convergence to low MSE values was obtained in every case, although none of these networks generalized to the desired language. Once again, it is possible that more data are required to constrain the language sufficiently.

Induction of Finite-State Languages

413

Table 4 Results of Training Network with 11 State-Units from 5 Random Starts on Tomita Language 5 Using Tomita Training Data. Seed

Iterations

72 104 239 89340 987235

1327 680 357 122 4502

MSE Accuracy Uncertainty 0.0002840909 0.0001136364 0.0006818145 0.0068189264 0.0001704545

53.00 39.47 61.31 63.36 48.41

11.87 16.32 3.32 6.64 16.95

5 Conclusions

We have succeeded in recognizing several simple finite-state languages using second-order recurrent networks. We consider the computation of the complete gradient a key element in this result.

Acknowledgments We thank Lee Giles for sharing with us their results (Giles et af. 1991).

References Cleeremans, A., Servan-Schreiber, D., and McClelland, J. 1989. Finite state automata and simple recurrent networks. Neural Comp. 1(3), 372-381. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-212. Giles, C. L., Chen, D., Miller, C. B., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1991. Second-order recurrent neural networks for grammatical inference. Proc. Int. Joint Conf. Neural Networks 11, 273-281. Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., and Chen, D. 1990. Higher order recurrent networks and grammatical inference. In Advances in Neural Information Systems 2, D. S. Touretzky, ed., pp. 380-387. Morgan Kaufmann, San Mateo, CA. Luenberger, D. G. 1984. Linear and Nonlinear Programming, 2nd ed. AddisonWesley, Reading, MA. Pollack, J. 1991. Personal Communication. Pollack, J. B. 1990. The Induction of Dynamical Recognizers. Tech. Rep. 90-JPAUTOMATA, Ohio State University.

414

Raymond L. Watrous and Gary M. Kuhn

Tomita, M. 1982. Dynamic construction of finite automata from examples using hill-climbing. Proc. Fourth Int. Cog. Sci.Conf., 105-108. Watrous, R. L., Ladendorf, B., and Kuhn, G. M. 1990. Complete gradient optimization of a recurrent network applied to /b/, /d/, /g/ discrimination. 1. Acoust. Sac. Am. 87(3), 1301-1309. Williams, R. J., and Zipser, D. 1988. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Tech. Rep. 1CS Report 8805, UCSD Institute for Cognitive Science.

Received 10 June 1991; accepted 12 December 1991.

This article has been cited by: 1. Chi Sing Leung, Ah Chung Tsoi. 2006. Combined learning and pruning for recurrent radial basis function networks based on recursive least square algorithms. Neural Computing and Applications 15:1, 62-78. [CrossRef] 2. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 3. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of Recurrent Neural Networks. IEEE Transactions on Neural Networks 15:1, 6-15. [CrossRef] 4. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 5. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 6. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 7. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 8. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 9. A. Blanco, M. Delgado, M. C. Pegalajar. 2000. Extracting rules from a (fuzzy/crisp) recurrent neural network using a self-organizing map. International Journal of Intelligent Systems 15:7, 595-621. [CrossRef] 10. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 12:1, 126-140. [CrossRef]

11. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 12. Chun-Hsien Chen, V. Honavar. 1999. A neural-network architecture for syntax analysis. IEEE Transactions on Neural Networks 10:1, 94-114. [CrossRef] 13. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE 87:9, 1623-1640. [CrossRef] 14. M. Gori, M. Maggini, E. Martinelli, G. Soda. 1998. Inductive inference from noisy examples using the hybrid finite state filter. IEEE Transactions on Neural Networks 9:3, 571-575. [CrossRef] 15. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef] 16. Sepp Hochreiter , Jürgen Schmidhuber . 1997. Long Short-Term MemoryLong Short-Term Memory. Neural Computation 9:8, 1735-1780. [Abstract] [PDF] [PDF Plus] 17. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 18. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 19. Christian W. Omlin, C. Lee Giles. 1996. Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid DiscriminantsStable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants. Neural Computation 8:4, 675-696. [Abstract] [PDF] [PDF Plus] 20. Catherine Hanson, Stephen José Hanson. 1996. Development of Schemata during Event Parsing: Neisser's Perceptual Cycle as a Recurrent Connectionist NetworkDevelopment of Schemata during Event Parsing: Neisser's Perceptual Cycle as a Recurrent Connectionist Network. Journal of Cognitive Neuroscience 8:2, 119-134. [Abstract] [PDF] [PDF Plus] 21. Paolo Frasconi, Marco Gori, Marco Maggini, Giovanni Soda. 1996. Representation of finite state automata in Recurrent Radial Basis Function networks. Machine Learning 23:1, 5-32. [CrossRef] 22. Y. Bengio, P. Frasconi. 1996. Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks 7:5, 1231-1249. [CrossRef] 23. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef]

24. Kam-Chuen Jim, C.L. Giles, B.G. Horne. 1996. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks 7:6, 1424-1438. [CrossRef] 25. Mikel L. Forcada , Rafael C. Carrasco . 1995. Learning the Initial State of a Second-Order Recurrent Neural Network during Regular-Language InferenceLearning the Initial State of a Second-Order Recurrent Neural Network during Regular-Language Inference. Neural Computation 7:5, 923-930. [Abstract] [PDF] [PDF Plus] 26. Peter Tiňo , Jozef Šajda . 1995. Learning and Extracting Initial Mealy Automata with a Modular Neural Network ModelLearning and Extracting Initial Mealy Automata with a Modular Neural Network Model. Neural Computation 7:4, 822-844. [Abstract] [PDF] [PDF Plus] 27. Peter Manolios , Robert Fanelli . 1994. First-Order Recurrent Neural Networks and Deterministic Finite State AutomataFirst-Order Recurrent Neural Networks and Deterministic Finite State Automata. Neural Computation 6:6, 1155-1173. [Abstract] [PDF] [PDF Plus] 28. C. L. Giles , C. B. Miller , D. Chen , H. H. Chen , G. Z. Sun , Y. C. Lee . 1992. Learning and Extracting Finite State Automata with Second-Order Recurrent Neural NetworksLearning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks. Neural Computation 4:3, 393-405. [Abstract] [PDF] [PDF Plus]

Communicated by David Haussler

Bayesian Interpolation David J.C. MacKay’ Computation and Neural Systems, California Institute of Technology 139-74, Pasadena, CA 91225 USA

Although Bayesian analysis has been in use since Laplace, the Bayesian method of model-comparison has only recently been developed in depth. In this paper, the Bayesian approach to regularization and model-comparison is demonstrated by studying the inference problem of interpolating noisy data. The concepts and methods described are quite general and can be applied to many other data modeling problems. Regularizing constants are set by examining their posterior probability distribution. Alternative regularizers (priors) and alternative basis sets are objectively compared by evaluating the evidence for them. ”Occam’s razor” is automatically embodied by this process. The way in which Bayes infers the values of regularizing constants and noise levels has an elegant interpretation in terms of the effective number of parameters determined by the data set. This framework is due to Gull and Skilling. 1 Data Modeling and Occam’s Razor In science, a central task is to develop and compare models to account for the data that are gathered. In particular this is true in the problems of learning, pattern classification, interpolation and clustering. Two levels of inference are involved in the task of data modeling (Fig. 1). At the first level of inference, we assume that one of the models that we invented is true, and we fit that model to the data. Typically a model includes some free parameters; fitting the model to the data involves inferring what values those parameters should probably take, given the data. The results of this inference are often summarized by the most probable parameter values and error bars on those parameters. This is repeated for each model. The second level of inference is the task of model comparison. Here, we wish to compare the models in the light of the data, and assign some sort of preference or ranking to the alternatives.’ *Present address: Darwin College, Cambridge CB3 9EU, U.K. ’Note that both levels of inference are distinct from decision theory. The goal of inference is, given a defined hypothesis space and a particular data set, to assign probabilities to hypotheses. Decision theory typically chooses between alternative actions on the basis of these probabilities so as to minimize the expectation of a “loss function.” This paper concerns inference alone and no loss functions or utilities are involved.

Neural Computation

4, 415-447 (1992)

@ 1992 Massachusetts Institute of Technology

416

David J.C . MacKay

For example, consider the task of interpolating a noisy data set. The data set could be interpolated using a splines model, using radial basis functions, using polynomials, or using feedforward neural networks. At the first level of inference, we take each model individually and find the best fit interpolant for that model. At the second level of inference we want to rank the alternative models and state for our particular data set that, for example, ”splines are probably the best interpolation model,” or “if the interpolant is modeled as a polynomial, it should probably be a cubic.” Bayesian methods are able consistently and quantitatively to solve both these inference tasks. There is a popular myth that states that Bayesian methods differ from orthodox (also known as “frequentist” or “sampling theory”) statistical methods only by the inclusion of subjective priors that are arbitrary and difficult to assign, and usually do not make much difference to the conclusions. It is true that at the first level of inference, a Bayesian’s results will often differ little from the outcome of an orthodox attack. What is not widely appreciated is how Bayes performs the second level of inference. It is here that Bayesian methods are totally different from orthodox methods. Indeed, when regression and density estimation are discussed in most statistics texts, the task of model comparison is virtually ignored; no general orthodox method exists for solving this problem. Model comparison is a difficult task because it is not possible simply to choose the model that fits the data best: more complex models can always fit the data better, so the maximum likelihood model choice would lead us inevitably to implausible overparameterized models that generalize poorly. “Occam’s razor” is the principle that states that unnecessarily complex models should not be preferred to simpler ones. Bayesian methods automatically and quantitatively embody Occam’s razor (Gull 1988; Jeffreys 19391, without the introduction of ad hoc penalty terms. Complex models are automatically self-penalizing under Bayes’ rule. Figure 2 gives the basic intuition for why this should be expected; the rest of this paper will explore this property in depth. Bayesian methods were first laid out in depth by the Cambridge geophysicist Sir Harold Jeffreys (1939). The logical basis for the Bayesian use of probabilities as measures of plausibility was subsequently established by Cox (1964), who proved that consistent inference in a closed hypothesis space can be mapped onto probabilities. For a general review of Bayesian philosophy the reader is encouraged to read the excellent papers by Jaynes (1986) and Loredo (1989). Since Jeffreys the emphasis of most Bayesian probability theory has been “to formally utilize prior information” (Berger 19851, that is, to perform inference in a way that makes explicit the prior knowledge and ignorance that we have, which orthodox methods omit. However, Jeffreys‘ work also laid the foundation for Bayesian model comparison, which does not involve an emphasis on prior information, but rather emphasizes getting maximal information

Bayesian Interpolation

417

to create new

Figure 1: Where Bayesian inference fits into the data modeling process. This figure illustrates an abstraction of the part of the scientific process, in which data are collected and modeled. In particular, this figure applies to pattern classification, learning, interpolation, etc. The two double-framed boxes denote the two steps that involve inference. It is only in those two steps that Bayes’ rule can be used. Bayes does not tell you how to invent models, for example. The first box, ”fitting each model to the data,” is the task of inferring what the model parameters might be given the model and the data. Bayes may be used to find the most probable parameter values, and error bars on those parameters. The result of applying Bayes to this problem is often little different from the result of using orthodox statistics. The second inference task, model comparison in the light of the data, is where Bayes is in a class of its own. This second inference problem requires a quantitative Occam’s razor to penalize over-complex models. Bayes can assign objective preferences to the alternative models in a way that automatically embodies Occam’s razor.

from the data. Jeffreys applied this theory to simple model comparison problems in geophysics, for example, testing whether a single additional parameter is justified by the data. Since the 1960s, Jeffreys’ model comparison methods have been applied and extended in the economics literature (Zellner 1984), and by a small number of statisticians (Box and Tiao 1973). Only recently has this aspect of Bayesian analysis been further developed and applied to more complex problems in other fields.

418

David J. C . MacKay

Figure 2: Why Bayes embodies Occam’s razor. This figure gives the basic intuition for why complex models are penalized. The horizontal axis represents the space of possible data sets D. Bayes rule rewards models in proportion to how much they predicted the data that occurred. These predictions are quantified by a normalized probability distribution on D. In this paper, this probability of the data given model 3-1;, P ( D I ‘Hi),is called the evidence for Hi. A simple model XI makes only a limited range of predictions, shown by P ( D 1 3-11); a more powerful model ‘H2, that has, for example, more free parameters than 3-11, is able to predict a greater variety of data sets. This means however that 3-12 does not predict the data sets in region C1 as strongly as 3-11. Assume that equal prior probabilities have been assigned to the two models. Then if the data set falls in region C1, the less powerful model 3-11 will be the more probable model. This paper will review Bayesian model comparison, “regularization,” and noise estimation, by studying the problem of interpolating noisy data. The Bayesian framework I will describe for these tasks is due to Gull (1988, 1989a), Gull and Skilling (1991), and Skilling (1991), who have used Bayesian methods to achieve the state of the art in image reconstruction. The same approach to regularization has also been developed in part by Szeliski (1989). Bayesian model comparison is also discussed by Bretthorst (19901, who has used Bayesian methods to push back the limits of NMR signal detection. The same Bayesian theory underlies the unsupervised classification system, Autoclass (Hanson et al. 1991). The fact that Bayesian model comparison embodies Occam’s razor has been rediscovered by Kashyap (1977) in the context of modeling time series; his paper includes a thorough discussion of how Bayesian model comparison is different from orthodox “Hypothesis testing.” One of the earliest applications of these sophisticated Bayesian methods of model comparison to real data is by Patrick and Wallace (1982); in this fascinating paper, competing models accounting for megalithic stone circle geometry are compared within the description length framework, which is equivalent to Bayes.

Bayesian Interpolation

419

As the quantities of data collected throughout science and engineering continue to increase, and the computational power and techniques available to model that data also multiply, I believe Bayesian methods will prove an ever more important tool for refining our modeling abilities. I hope that this review will help to introduce these techniques to the "neural" modeling community. A companion paper (MacKay 1992a) will demonstrate how these techniques can be fruitfully applied to backpropagation neural networks. Another paper will show how this framework relates to the task of selecting where next to gather data so as to gain maxima1 information about our models (MacKay 1992b). 2 The Evidence and the Occam Factor

Let us write down Bayes' rule for the two levels of inference described above, so as to see explicitly how Bayesian model comparison works. Each model 'Hi ('H stands for "hypothesis") is assumed to have a vector of parameters w. A model is defined by its functional form and two probability distributions: a "prior" distribution P(w I 'Hi) that states what values the model's parameters might plausibly take; and the predictions P ( D I w, 'Hi) that the model makes about the data D when its parameters have a particular value w. Note that models with the same parameterization but different priors over the parameters are therefore defined to be different models. 1. Model fitting. At the first level of inference, we assume that one model 'Hiis true, and we infer what the model's parameters w might be given the data D. Using Bayes' rule, the posterior Probability of the parameters w is

(2.1) In words: Posterior =

Likelihood x Prior Evidence

The normalizing constant P ( D 1 7-l;) is commonly ignored, since it is irrelevant to the first level of inference, that is, the choice of w; but it will be important in the second level of inference, and we name it the midence for ' H i . It is common to use gradient-based methods to find the maximum of the posterior, which defines the most probable value for the parameters, W M ~it; is then common to summarize the posterior distribution by the value of WMP, and error bars on these best fit parameters. The error bars are obtained from the curvature

420

David J. C. MacKay of the posterior; writing the Hessian A = -VV logP(w I D, ' H i ) and Taylor-expanding the log posterior with Aw = w - WMP,

we see that the posterior can be locally approximated as a gaussian with covariance matrix (error bars) A-'.2 2. Model comparison. At the second level of inference, we wish to infer which model is most plausible given the data. The posterior probability of each model is

Notice that the data-dependent term P ( D I 'Hi) is the evidence for ' H i , which appeared as the normalizing constant in equation 2.1. The second term, P('Hi), is a "subjective" prior over our hypothesis space that expresses how plausible we thought the alternative models were before the data arrived. We will see later that this subjective part of the inference will typically be overwhelmed by the objective term, the evidence. Assuming that we have no reason to assign strongly differing priors P('H,) to the alternative models, models 'Hi are ranked by evaluating the evidence. Equation 2.3 has not been normalized because in the data modeling process we may develop new models after the data have arrived (Fig. l), when an inadequacy of the first models is detected, for example. So we do not start with a completely defined hypothesis space. Inference is open ended: we continually seek more probable models to account for the data we gather. New models are compared with previous models by evaluating the evidence for them. The key concept of this paper is this: to assign a preference to alternative models Xi,a Bayesian evaluates the evidence P(D I ' H i ) . This concept is very general: the evidence can be evaluated for parametric and "nonparametric" models alike; whether our data modeling task is a regression problem, a classification problem, or a density estimation problem, the evidence is the Bayesian's transportable quantity for comparing alternative models. In all these cases the evidence naturally embodies Occam's razor; we will examine how this works shortly. 'Whether this approximation is a good one or not will depend on the problem we are solving. For the interpolation models discussed in this paper, there is only a single maximum in the posterior distribution, and the gaussian approximation is exact. For more general statistical models we still expect the posterior to be dominated by locally gaussian peaks on account of the central limit theorem (Walker 1967). Multiple maxima that arise in more complex models complicate the analysis, but Bayesian methods can still successfully be applied (Hanson et al. 1991; MacKay 1992a; Neal 1991).

Bayesian Interpolation

421

Of course, the evidence is not the whole story if we have good reason to assign unequal priors to the alternative models 31. (To only use the evidence for model comparison is equivalent to using maximum likelihood for parameter estimation.) The classic example is the ”Sure Thing” hypothesis, @ E.T. Jaynes, which is the hypothesis that the data set will be D, the precise data set that actually occurred; the evidence for the Sure Thing hypothesis is huge. But Sure Thing belongs to an immense class of similar hypotheses that should all be assigned correspondingly tiny prior probabilities; so the posterior probability for Sure Thing is negligible alongside any sensible model. Models like Sure Thing are rarely seriously proposed in real life, but if such models are developed then clearly we need to think about precisely what priors are appropriate. Patrick and Wallace (1982), studying the geometry of ancient stone circles (for which some people have proposed extremely elaborate theories!), discuss a practica1 method of assigning relative prior probabilities to alternative models by evaluating the lengths of the computer programs that decode data previously encoded under each model. This procedure introduces a second sort of Occam’s razor into the inference, namely a prior bias against complex models. However, this paper will not include such prior biases; we will address only the data’s preference for the alternative models, that is, the evidence, and the Occam’s razor that it embodies. In the limit of large quantities of data this objective Occam’s razor will always be the more important of the two.

2.1 A Modern Bayesian Approach to Priors. It should be pointed out that the emphasis of this modern Bayesian approach is not on the inclusion of priors into inference. There is not one significant “subjective prior” in this entire paper. (For problems where significant subjective priors do arise see Gull 1989b; Skilling 1989.) The emphasis is that consistent degrees of preference for alternative hypotheses are represented by probabilities, and relative preferences for models are assigned by evaluating those probabilities. Historically, Bayesian analysis has been accompanied by methods to work out the “right” prior P(w I IFI) for a problem, for example, the principles of insufficient reason and maximum entropy. The modern Bayesian, however, does not take a fundamentalist attitude to assigning the “right” priors - many different priors can be tried; each particular prior corresponds to a different hypothesis about the way the world is. We can compare these alternative hypotheses in the light of the data by evaluating the evidence. This is the way in which alternative regularizers are compared, for example. If we try one model and obtain awful predictions, we have learned something. “A failure of Bayesian prediction is an opportunity to learn” (Jaynes 1986), and we are able to come back to the same data set with new models, using new priors for example.

David J. C. MacKay

422

c

W

Aow

Figure 3: The Occam factor. This figure shows the quantities that determine the Occam factor for a hypothesis 7-Ii having a single parameter W. The prior distribution (dotted line) for the parameter has width Aow. The posterior distribution (solid line) has a single peak at WMP with characteristic width Aw. The Occam factor is Aw/A*w. 2.2 Evaluating the Evidence. Let us now explicitly study the evidence to gain insight into how the Bayesian Occam's razor works. The evidence is the normalizing constant for equation 2.1:

P ( D I h!;)=

1

P ( D 1 w1'Hi)P(W I 'Hi) dw

(2.4)

For many problems, including interpolation, it is common for the posterior P(w I D, h!,) c( P ( D 1 w, 'Hi)P(wI 'Hi) to have a strong peak at the most probable parameters WMP (Fig. 3). Then the evidence can be approximated by the height of the peak of the integrand P ( D 1 w,'H;)P(wI h!i) times its width, Aw:

--

(2.5) P ( D I 'Hi) N P ( D I W M P , ~P(WMVI~ ! ~ ) I 'Hi) AW Evidence N Best fit likelihood Occam factor Thus the evidence is found by taking the best fit likelihood that the model can achieve and multiplying it by an "Occam factor" (Gull 1988), which is a term with magnitude less than one that penalizes 'Hi for having the parameter w. 2.3 Interpretation of the Occam Factor. The quantity A w is the posterior uncertainty in w. Imagine for simplicity that the prior P(w 1 h!]) is uniform on some large interval AOw, representing the range of values of w that 7-1, thought possible before the data arrived (Fig. 3). Then P(WMPI 'HI) = l/Aow, and Aw Occam factor = Aow ~

Bayesian Interpolation

423

that is, the ratio of the posterior accessible volume of ‘His parameter space to the prior accessible volume, or the factor by which ‘Hi’s hypothesis space collapses when the data arrive (Gull 1988; Jeffreys 1939). The model Xi can be viewed as being composed of a certain number of equivalent submodels, of which only one survives when the data arrive. The Occam factor is the inverse of that number. The log of the Occam factor can be interpreted as the amount of information we gain about the model when the data arrive. Typically, a complex model with many parameters, each of which is free to vary over a large range Aow, will be penalized with a larger Occam factor than a simpler model. The Occam factor also provides a penalty for models that have to be finely tuned to fit the data; the Occam factor promotes models for which the required precision of the parameters Aw is coarse. The Occam factor is thus a measure of complexity of the model, but unlike the V-C dimension or algorithmic complexity, it relates to the complexity of the predictions that the model makes in data space; therefore it depends on the number of data points and other properties of the data set. Which model achieves the greatest evidence is determined by a trade-off between minimizing this natural complexity measure and minimizing the data misfit. 2.4 Occam Factor for Several Parameters. If w is k-dimensional, and if the posterior is well approximated by a gaussian, the Occam factor is given by the determinant of the gaussian’s covariance matrix:

P ( D I ‘Hi) Evidence

21

1~

P ( D I w ~ p , H i )P ( w ~ pI ‘Hi)(2~)~/’det-’/’A , -‘ +

Best fit likelihood

(2.6)

Occam factor

where A = -VV logP(w I D , ‘Hi), the Hessian that we already evaluated when we calculated the error bars on WMP. As the amount of data collected, N,increases, this gaussian approximation is expected to become increasingly accurate on account of the central limit theorem (Walker 1967). For the linear interpolation models discussed in this paper, this gaussian expression is exact for any N. 2.5 Comments. 0

Bayesian model selection is a simple extension of maximum likelihood model selection: fheevidence is obtained by muzti~lyingthe best fit likelihood by the Occam factor. To evaluate the Occam factor all we need is the Hessian A, if the gaussian approximation is good. Thus the Bayesian method of model comparison by evaluation of the evidence is computationally no more demanding than the task of finding for each model the best fit parameters and their error bars.

David J. C. MacKay

424 0

0

0

It is common for there to be degeneracies in models with many parameters; that is, several equivalent parameters could be relabeled without affecting the likelihood. In these cases, the right-hand side of equation 2.6 should be multiplied by the degeneracy of WMP to give the correct estimate of the evidence. ”Minimum description length” (MDL) methods are closely related to this Bayesian framework (Rissanen 1978; Wallace and Boulton 1968; Wallace and Freeman 1987). The log evidence log,P(D I 31,) is the number of bits in the ideal shortest message that encodes the data D using model 31,. Akaike’s (1970) criterion can be viewed as an approximation to MDL (Schwarz 1978; Zellner 1984). Any implementation of MDL necessitates approximations in evaluating the length of the ideal shortest message. Although some of the earliest work on complex model comparison involved the MDL framework (Patrick and Wallace 19821, I can see no advantage in MDL, and recommend that the evidence should be approximated directly. It should be emphasized that the Occam factor has nothing to do with how computationally complex it is to use a model. The evidence is a measure of plausibility of a model. How much CPU time it takes to use each model is certainly an interesting issue that might bias our decisions toward simpler models, but Bayes’ rule does not address that issue. Choosing between models on the basis of how many function calls they need is an exercise in decision theory, which is not addressed in this paper. Once the probabilities described above have been inferred, optimal actions can be chosen using standard decision theory with a suitable utility function.

3 The Noisy Interpolation Problem

Bayesian interpolation through noise-fvee data has been studied by Sibisi (1991). In this paper I study the problem of interpolating through data where the dependent variables are assumed to be noisy (a task also known as ”regression,” ”curve-fitting,” ”signal estimation,” or, in the neural networks community, “learning”). I am not examining the case where the independent variables are also noisy. This different and more difficult problem has been studied for the case of straight line-fitting by Gull (1989b). Let us assume that the data set to be interpolated is a set of pairs D = {x,,,, t,}, where m = 1 . . .N is a label running over the pairs. For simplicity I will treat x and t as scalars, but the method generalizes to the multidimensional case. To define a linear interpolation model, a set

Bayesian Interpolation

425

of k fixed basis functions3 d = (q$,(x)} is chosen, and the interpolated function is assumed to have the form: (3.1)

where the parameters wh are to be inferred from the data. The data set is modeled as deviating from this mapping under some additive noise process: tm = y ( x m )

+vm

(3.2)

If v is modeled as zero-mean gaussian noise with standard deviation g,,, then the probability of the data4 given the parameters w is

where /3 = l/a2,, ED = C,,!fk(xm) - f,]’, and ZD = ( 2 ~ / / 3 ) ~ / ~ . P(D 1 w, p, A) is called the likelihood. It is well known that finding the maximum likelihood parameters wML may be an “ill-posed” problem. That is, the w that minimizes ED is underdetermined and/or depends sensitively on the details of the noise in the data; the maximum likelihood interpolant in such cases oscillates wildly so as to fit the noise. Thus it is clear that to complete an interpolation model we need a prior R that expresses the sort of smoothness we expect the interpolant y(x) to have. A model may have a prior of the form (3.4) where Ey might be for example the functional E, = Jy”(x)’dx (which is the regularizer for cubic spline interpolation5). The parameter N is a measure of how smooth f(x) is expected to be. Such a prior can also be written as a prior on the parameters w: (3.5) where Zw = Jdkwexp(-aEw). Ew (or EY)is commonly referred to as a regularizing function. The interpolation model is now complete, consisting of a choice of basis functions d,a noise model with parameter P, and a prior (regularizer) ‘R, with regularizing constant a. 3The case of adaptive basis functions, also known as feedforward neural networks, is examined in a companion paper. 4Strictly,this probability should be written P ( { t m } I { x m } ,w,P, A), since these interpolation models do not predict the distribution of input variables { x m } ;this liberty of notation will be taken throughout this paper and its companion. 5Strictly,this particular prior may be improper because a y(x) of the form w ~ + x wo is not constrained by this prior.

David J. C. MacKay

426

3.1 The First Level of Inference. If CY and /3 are known, then the posterior probability of the parameters w is6 (3.6)

Writing'

M(w) = aEw

+ ,BED

(3.7)

the posterior is

(3.8) where ZM(CY, P ) = Jdkw exp(-M). We see that minimizing the combined objective function M corresponds to finding the most probable interpolant, W M ~ . Error bars on the best fit interpolant* can be obtained from the Hessian of M, A = VVM, evaluated at WMP. This is the well known Bayesian view of regularization (Poggio et al. 1985; Titterington 1985). Bayes can do a lot more than just provide an interpretation for regularization. What we have described so far is just the first of three levels of inference. (The second level described in sections 1 and 2, "model comparison," splits into a second and a third level for this problem, because each interpolation model is made up of a continuum of submodels with different values of CY and P.) At the second level, Bayes allows us to objectively assign values to CY and P, which are commonly unknown a priori. At the third, Bayes enables us to quantitatively rank alternative basis sets A, alternative regularizers (priors) R, and, in principle, alternative noise models. Furthermore, we can quantitatively compare interpolation under any model A, R with other interpolation and learning models such as neural networks, if a similar Bayesian approach is applied to them. Neither the second nor the third level of inference can be successfully executed without Occam's razor. The Bayesian theory of the second and third levels of inference has only recently been worked out (Gull 1989a); this paper's goal is to review that framework. Section 4 will describe the Bayesian method of inferring CY and p; Section 5 will describe Bayesian model comparison for the interpolation problem. Both these inference problems are solved by evaluation of the appropriate evidence. 'The regularizer a , R has been omitted from the conditioning variables in the likelihood because the data distribution does not depend on the prior once w is known. Similarly the prior does not depend on p. 7The name M stands for "misfit";it will be demonstrated later that M is the natural measure of misfit, rather than & = ZPED. 'These error bars represent the uncertainty of the interpolant, and should not be confused with the typical scatter of noisy data points relative to the interpolant.

Bayesian Interpolation

427

4 Selection of Parameters a and /3

Typically, a is not known a priori, and often ,b is also unknown. As a is varied, the properties of the best fit (most probable) interpolant vary. Assume that we are using a prior that encourages smoothness, and imagine that we interpolate at a very large value of a; then this will constrain the interpolant to be very smooth and flat, and it will not fit the data at all well (Fig. 4a). As a is decreased, the interpolant starts to fit the data better (Fig.4b). If a is made even smaller, the interpolant oscillates wildly so as to overfit the noise in the data (Fig. 4c). The choice of the "best" value of (Y is our first "Occam's razor" problem: large values of a correspond to simple models that make constrained and precise predictions, saying "the interpolant is expected to not have extreme curvature anywhere"; a tiny value of a corresponds to the more powerful and flexible model that says "the interpolant could be anything at all, our prior belief in smoothness is very weak." The task is to find a value of a that is small enough that the data are fitted but not so small that they are overfitted. For more severely ill-posed problems such as deconvolution, the precise value of the regularizing parameter is increasingly important. Orthodox statistics has ways of assigning values to such parameters, based for example on misfit criteria, the use of test data, and cross-validation. Gull (1989a)has demonstrated why the popular use of misfit criteria is incorrect and how Bayes sets these parameters. The use of test data may be an unreliable technique unless large quantities of data are available. Cross-validation, the orthodox "method of choice" (Eubank 19881, will be discussed more in Section 6.6 and MacKay (1992a). I will explain the Bayesian method of inferring a and ,b after first reviewing some statistics of misfit.

4.1 Misfit, x2,and the Effect of Parameter Measurements. For N independent gaussian variables with mean p and standard deviation D, the statistic x2 = C ( X - ~ ) is ~ a/ measure O ~ of misfit. If p is known a priori, x2 has expectation Nf However, if p is fitted from the data by setting p = X, we "use up a degree of freedom," and x2 has expectation N- 1. In the second case p is a "well-measured parameter." When a parameter is determined by the data in this way it is unavoidable that the parameter fits some of the noise in the data as well. That is why the expectation of x2 is reduced by one. This is the basis of the distinction between the ON and ON-^ buttons on your calculator. It is common for this distinction to be ignored, but in cases such as interpolation where the number of free parameters is similar to the number of data points, it is essential to find and make the analogous distinction. It will be demonstrated that the Bayesian choices of both Q and ,b are most simply expressed in terms of the effective number of well-measured parameters, 7 , to be derived below.

a.

David J. C. MacKay

428

3

1.3

.

3

.

,

I

,

,

.

. *..

.

I"t._L."z

. 0.-

.

-

b) Figure 4: How the best interpolant depends on a. 'These figures introduce a data set, "X," that is interpolated with a variety of models in this paper. Notice that the density of data points is not uniform on the x-axis. In the three figures the data set is interpolated using a radial basis function model with a basis of 60 equally spaced Cauchy functions, all with radius 0.2975. The regularizer is Ew = C w2,where w are the coefficients of the basis functions. Each figure shows the most probable interpolant for a different value of a: (a) 6000; (b) 2.5; (c) Note at the extreme values how the data are oversmoothed and overfitted, respectively. Assuming a flat prior, N = 2.5 is the most probable value of a. In (b), the most probable interpolant is displayed with its la error bars, which represent how uncertain we are about the interpolant at each point, under the assumption that the interpolation model and the value of N are correct. Notice how the error bars increase in magnitude where the data are sparse. The error bars do not get bigger near the datapoint close to (l,O), because the radial basis function model does not expect sharp discontinuities; the error bars are obtained assuming the model is correct, so that point is interpreted as an improbable outlier.

Misfit criteria are "principles" that set parameters like a and (3 by requiring that x2 should have a particular value. The discrepancy principle requires x2 = N. Another principle requires x2 = N - k, where k is the number of free parameters. We will find that an intuitive misfit criterion

Bayesian Interpolation

429

arises for the most probable value of /j; on the other hand, the Bayesian choice of cy will be unrelated to the value of the misfit. 4.2 Bayesian Choice of N and 0. To infer from the data what value and B should have: Bayesians evaluate the posterior probability distribution:

Q

The data-dependent term P ( D I a,P , A, R) has already appeared earlier as the normalizing constant in equation 3.6, and it is called the evidence for cy and 8. Similarly the normalizing constant of equation 4.1 is called the evidence for A, R, and it will turn up later when we compare alternative models A, R in the light of the data. If P(a,/?)is a flat prior" (which corresponds to the statement that we do not know what value cy and /? should have), the evidence is the function that we use to assign a preference to alternative values of Q and p. It is given in terms of the normalizing constants defined earlier by

Occam's razor is implicit in this formula: if a is small, the large freedom in the prior range of possible values of w is automatically penalized by the consequent large value of ZW; models that fit the data well achieve a large value of ZM. The optimum value of Q achieves a compromise between fitting the data well and being a simple model. Now to assign a preference to ( a , / ? ) our , computational task is to evaluate the three integrals ZM, ZW, and ZD. We will come back to this task in a moment. 4.2.1 But That Sounds Like Determining Your Prior after the Data Haue Arrived! This is an aside that can be omitted on a first reading. When I first heard the preceding explanation of Bayesian regularization I was discontent because it seemed that the prior is being chosen from an ensemble of possible priors after the data have arrived. To be precise, as described 'Note that it is not satisfactory to simply maximize the likelihood simultaneously over w, a, and 0;the likelihood has a skew peak such that the maximum likelihood value for the parameters is not in the same place as most of the posterior probability (Gull 1989a). To get a feeling for this here is a more familiar problem: examine the posterior probability for the parameters of a gaussian ( p ,a ) given N samples: the maximum likelihood value for u is UN, but the most probable value for u (found by integrating over p ) is U N - ~ . It should be emphasized that this distinction has nothing to do with the prior over the parameters, which is flat here. It is the process of marginalization that corrects the bias of maximum likelihood. '"Since (Y and fi are scale parameters, this prior should be understood as a flat prior over loga and log@.

David J. C. MacKay

430

above, the most probable value of a is selected; then the prior corresponding to that value of a alone is used to infer what the interpolant might be. This is not how Bayes would have us infer the interpolant. It is the combined ensemble of priors that define our prior, and we should integrate over this ensemble when we do inference.” Let us work out what happens if we follow this proper approach. The preceding method of using only the most probable prior will emerge as a good approximation. The true posterior P(w I D , d , R ) is obtained by integrating over a and P:

P(w I D, A, R)= /P(w

I D , a, P, d,R)P(a,P I D, A, R)d a dP

(4.3)

In words, the posterior probability over w can be written as a linear combination of the posteriors for all values of a,P. Each posterior density is weighted by the probability of cr, P given the data, which appeared in equation 4.1. This means that if P ( a , P I D , d , R ) has a dominant peak at &,b,then the true posterior P(w I D , d , R )will be dominated by the density P(w I D , b , p , d , R ) . As long as the properties of the posterior P(w I D,cr, p, A, R ) do not change rapidly with a , j5’ near b, b and the peak in P(a,P I D , A, R) is strong, we are justified in using the approximation: P(w I D , d , R ) -P(w

I D,&,b,d,R)

(4.4)

This approximation is valid under the same conditions as in footnote 12. It is a matter of ongoing research to develop computational methods for cases where this approximation is invalid (Sibisi and Skilling, personal communication). 4.3 Evaluating the Evidence. Let us return to our train of thought at equation 4.2. To evaluate the evidence for a,P, we want to find the integrals ZM,ZW,and ZD. Typically the most difficult integral to evaluate is ZM. z M ( a , P ) = dkW exp[-M(wi ( 2 , P)]

1

If the regularizer R is a quadratic functional (and the favorites are), then ED and EW are quadratic functions of w, and we can evaluate ZMexactly. Letting VVEw = C and VVED = B then using A = aC + PB, we have 1 M = M ( w M ~ ) T(w - W M ~ ) ~ A-(WMP) W

+

where WMP = PA-IBwML.This means that ZM is the gaussian integral: ZM= e-’~(2n)k/zdet-1/2A

(4.5)

“It is remarkable that Laplace almost got this right in 1774 (Stigler 1986); when inferring the mean of a Laplacian distribution,he both inferred the posterior probability of a nuisance parameter like in equation 4.1, and then attempted to integrate out the nuisance parameter as in equation 4.3.

Bayesian Interpolation

431

b)

a)

500

...

t

----

0.001 0.01

0.1

1

,.'

1 Alpha

10

100

1000

Figure 5: Choosing a. (a) The evidence as a function of a: Using the same radial basis function model as in Figure 4, this graph shows the log evidence as a function of a, and shows the functions.that make up the log evidence, namely the data misfit xh = 2,!3E~,the weight penalty term x$, = 2crEw, and the log of the Occam factor (2K)k/2det-'/2A/Zw(a).(b) Criteria for optimizing a: This graph shows the log evidence as a function of a, and the functions whose intersection locates the evidence maximum: the number of good parameter measurements y, and x$,. Also shown is the test error (rescaledl on two test sets; finding the test error minimum is an alternative criterion for setting a. Both test sets were more than twice as large in size as the interpolated data set. Note how the point at which = y is clear and unambiguous, which cannot be said for the minima of the test energies. The evidence gives Q a I-a confidence interval of [1.3,5.0]. The test error minima are more widely distributed because of finite sample noise.

xk

In many cases where the regularizer is not quadratic (for example, entropy-based), this gaussian approximation is still servicable (Gull 1989a). Thus we can write the log evidence for Q and p as

logP(D I a,P,A,R)

=

1 -aEZp -pEzP - -logdetA-logZw(cx) 2 - logZ@)

+ 5k log27r

(4.6)

The term PEEP represents the misfit of the interpolant to the data. The three terms -aEEp - log det A -log Zw(a) constitute the log of the "Occam factor" penalizing over-powerful values of a: the ratio ( 2 ~ ) ~ / * d e t - " ~ A / Z ~is( the a ) ratio of the posterior accessible volume in parameter space to the prior accessible volume, and the term aEZp measures how far WMP is from its null value. Figure 5a illustrates the behavior

David J. C. MacKay

432

of these various terms as a function of a for the same radial basis function model as illustrated in Figure 4. Now we could just proceed to evaluate the evidence numerically as a function of LY and P, but a more deep and fruitful understanding of this problem is possible. 4.4 Properties of the Evidence Maximum. The maximum over a , p of P ( D I (Y, P, A, R)= Z ~ ( ( t iP)/[zw(a)zD(P)] , has some remarkable properties that give deeper insight into this Bayesian approach. The results of this section are useful both numerically and intuitively. Following Gull (1989a),we transform to the basis in which the Hessian of Ew is the identity, VVEw = I. This transformation is simple in the case of quadratic Ew: rotate into the eigenvector basis of C and stretch the axes so that the quadratic form Ew becomes homogeneous. This is the natural basis for the prior. I will continue to refer to the parameter vector in this basis as w, so from here on Ew = Cwf. Using VVM = A and VVED = B as above, we differentiate the log evidence with respect to a and p so as to find the condition that is satisfied at the maximum. The log evidence, from equation 4.6, is

logP(D 1 a , p, A, R)

=

1 -aEEP - DEEp - - logdet A 2

k N N + -loga+-log~~--log2.ir 2 2 2 -

(4.7)

First, differentiating with respect to a, we need to evaluate d / d a log det A. Using A = nI PB,

+

d -

dcr

IogdetA

=

Trace

=

Trace (A-'I) = TraceA-'

This result is exact if Ew and ED are quadratic. Otherwise this result is an approximation, omitting terms in dB/acu. Now, differentiating equation 4.7 and setting the derivative to zero, we obtain the following condition for the most probable value of a: 2aEgP = k - crTraceA-'

(4.8)

The quantity on the left is the dimensionless measure of the amount of structure introduced into the parameters by the data, that is, how much the fitted parameters differ from their null value. It can be interpreted as the x2 of the parameters, since it is equal to x& = C 4 / & , with (I. = I/& The quantity on the right of equation 4.8 is called the number of good parameter measurements, y, and has value between 0 and k. It can be written in terms of the eigenvalues of PB, A,, where the subscript a runs

Bayesian Interpolation

433

w21GwMp

‘wuL

’

Figure 6: Good and bad parameter measurements. Let w1 and w2 be the components in parameter space in two directions parallel to eigenvectors of the data matrix B. The circle represents the characteristic prior distribution for w. The ellipse represents a characteristic contour of the likelihood, centered on the maximum likelihood solution WML. WMP represents the most probable parameter vector. w1 is a direction in which A1 is small compared to a, that is, the data have no strong preference about the value of wl;w1 is a poorly measured parameter, and the term X,/(X, + a ) is close to zero. w2 is a direction in which XI is large; w2 is well determined by the data, and the term Xz/(X, + a ) is close to one.

over the k eigenvectors. The eigenvalues of A are A,

y = k - aTraceA-’

-

=

+ a, so we have

k

(4.9) a=l

Each eigenvalue Xa measures how strongly one parameter is determined by the data. The constant a measures how strongly the parameters are determined by the prior. The ath term ya = X,/(A, + a ) is a number between 0 and 1 that measures the strength of the data in direction a relative to the prior (Fig. 6): the components of WMP are given by W M P ~ = 7aWMI.a.

A direction in parameter space for which A, is small compared to a does not contribute to the number of good parameter measurements. y is thus a measure of the effective number of parameters that are well determined by the data. As a l p 0, y increases from 0 to k. The condition, equation 4.8, for the most probable value of a can therefore be interpreted as an estimation of the variance of the gaussian distribution from which the weights are drawn, based on y effective samples from that distribution: CT$= C w f / y . This concept is not only important for locating the optimum value of a: it is only the y good parameter measurements that are expected to contribute to the reduction of the data misfit that occurs when a model is fitted to noisy data. In the process of fitting w to the data, it is unavoidable that some fitting of the model to noise will occur, because some --f

David J. C. MacKay

434

components of the noise are indistinguishable from real data. Typically, one unit (x2)of noise will be fitted for every well-determined parameter. Poorly determined parameters are determined by the regularizer only, so they do not reduce xk in this way. We will now examine how this concept enters into the Bayesian choice of p. Recall that the expectation of the x2 misfit between the true interpolant and the data is N. However we do not know the true interpolant, and the only misfit measure to which we have access is the x2 between the inferred interpolant and the data, & = 2 0 E ~ The . ”discrepancy principle” of orthodox statistics states that the model parameters should be adjusted so as to make = N. Work on unregularized least-squares regression suggests that we should estimate the noise level so as to set xb = N - k, where k is the number of free parameters. Let us find out the opinion of Bayes’ rule on this matter. We differentiate the log evidence, equation 4.7, with respect to p and obtain, setting the derivative to zero:

xh

(4.10)

2 p E ~= N -

Thus the most probable noise estimate, b, does not satisfy & = N or = N - k; rather, xk = N - y. This Bayesian estimate of noise level naturally takes into account the fact that the parameters that have been determined by the data inevitably suppress some of the noise in the data, while the poorly measured parameters do not. Note that the value of x; enters only into the determination of p: misfit criteria have no role in the Bayesian choice of D (Gull 1989a). In summary, at the optimum value of cy and [j, xh = y, x; = N - y. Notice that this implies that the total misfit M = NEW PEL, satisfies the simple equation 2h4 = N. The interpolant resulting from the Bayesian choice of N is illustrated by Figure 4b. Figure 5b illustrates the functions involved with the Bayesian choice of a, and compares them with the “test error” approach. Demonstration of the Bayesian choice of p is omitted, since it is straightforward; /3 is fixed to its true value for the demonstrations in this paper. Inference of an input-dependent noise level p ( x ) will be demonstrated in a future publication. These results generalize to the case where there are two or more separate regularizers with independent regularizing constants {a,} (Gull 1989a). In this case, each regularizer has a number of good parameter measurements yc associated with it. Multiple regularizers will be used in the companion paper on neural networks. Finding the evidence maximum with a head-on approach would involve evaluating det A while searching over a , p; the above results (equations 4.8,4.10) enable us to speed up this search (for example, by the use of reestimation formulas like a := y/2Ew) and replace the evaluation of detA by the evaluation of TraceA-’. For large dimensional problems

xh

+

Bayesian Interpolation

435

where this task is demanding, Skilling (1989) has developed methods for estimating TraceA-' statistically in k2 time. 5 Model Comparison

To rank alternative basis sets d and regularizers (priors) R in the light of the data, we examine the posterior probabilities:

P(A,R I D)0: P ( D I A,R)P(A,R)

(5.1)

The data-dependent term, the evidence for A, R,appeared earlier as the normalizing constant in equation 4.1, and is evaluated by integrating the evidence for ( a ,P):

P ( D I 4 R)=

1

P ( D I A,R,a , D)p(a,P ) da @

(5.2)

Assuming that we have no reason to assign strongly differing priors P(d,R), alternative models A, R are ranked just by examining the evidence. The evidence can also be compared with the evidence found by an equivalent Bayesian analysis of other learning and interpolation models so as to allow the data to assign a preference to the alternative models. Notice as pointed out earlier that this modern Bayesian framework includes no emphasis on defining the "right" prior R with which we ought to interpolate. Rather, we invent as many priors (regularizers) as we want, and allow the data to tell us which prior is most probable. Having said this, I would still recommend that the "maximum entropy principle" and other respected guides should be consulted when inventing these priors (see Gull 1988, for example). 5.1 Evaluating the Evidence for d,R. As a and p vary, a single evidence maximum is obtained, at &, fi (at least for quadratic ED and Ew). The evidence maximum is often well approximated12by a separable gaussian, and differentiating equation 4.7 twice we obtain gaussian error bars for logo and logp:

(Aloga)' (AlogP)2

N

2/y

N

2/(N-y)

-

Putting these error bars into equation 5.2, we obtain the evidence.I3 P ( D I d,R) P ( D I &,

a,d,R ) P ( & ,a)27r Alog a Alog P

(5.3)

I2Thisapproximation is valid when, in the spectrum of eigenvalues of PB, the number of eigenvalues within e-fold of & is O(1). I3There are analytic methods for performing such integrals over (Bretthorst 1990).

David J. C. MacKay

436

a)

How is the prior P ( 6 , assigned? This is the first time in this paper that we have met one of the infamous "Subjective priors" that are supposed to plague Bayesian methods. Here are some answers to this question. (1) Any other method of assigning a preference to alternatives must implicitly assign such priors. Bayesians adopt the healthy attitude of not sweeping them under the carpet. (2) With some thought, reasonable values can usually be assigned to subjective priors, and the degree of reasonable subjectivity in these assignments can be quantified. For example, a reasonable prior on an unknown standard deviation states that 0 is unknown over a range of ( 3 f 2 ) orders of magnitude. This prior contributes a subjectivity of about &1to the value of the log evidence. This degree of subjectivity is often negligible compared to the log evidence differences. (3) In the noisy interpolation example, all models considered include the free parameters a and p. So in this paper I do not need to assign a value to P(&, I assume that it is a flat prior (flat over log a and logp, since a and p are scale parameters) that cancels out when we compare alternative interpolation models.

8);

6 Demonstration

These demonstrations will use two one-dimensional data sets, in imitation of Sibisi (1991). The first data set, "X," has discontinuities in derivative (Fig. 4), and the second is a smoother data set, "Y" (Fig. 8). In all the demonstrations, p was not left as a free parameter, but was fixed to its known true value. The Bayesian method of setting a, assuming a single model is correct, has already been demonstrated, and quantified error bars have been placed on the most probable interpolant (Fig. 4). The method of evaluating the error bars is to use the posterior covariance matrix of the parameters Wh, A-*, to get the variance on y(x), which for any x is a linear function of the parameters, y(x) = c h (bh(X)Wh.The error bars at a single point x are given by vary(x) = (bTA-'4. Actually we have access to the full covariance information for the entire interpolant, not just the pointwise error bars. It is possible to visualize the joint error bars on the interpolant by making typical samples from the posterior distribution, performing a random walk around the posterior "bubble" in parameter space (Sibisi 1991; Skilling et al. 1991). Figure 8 shows data set Y interpolated by three typical interpolants found by random sampling from the posterior distribution. These error bar properties are found under the assumption that the model is correct; so it is possible for the true interpolant to lie significantly outside the error bars of a poor model. In this section Bayesian model comparison will be demonstrated first with models differing only in the number of free parameters (for example, polynomials of different degrees), then with comparisons between models as disparate as splines, radial basis functions, and feedforward

Bayesian Interpolation .

0

.

.

437

.

.

E v l d m c . for 1.Q.ndre

.

.

.

polynml.1.

.

C

*

C w e h y lunsrlmn.

-20

,;:-100 .!*6oI-

-120 -140 -160

G.UI.1.".

25

20

30

,

,

3

-

:I

-40

f

-80

-1oa

b)

L.... ....... * ..l ..........~. ............ ..................... ... 2---; 1........ :-

- , ~

v *

f

55

45 50 function.

..,,-.=s-

-20

.

,

40

35

of bA.1.

Nd.r

0

,

'

15

4

j

-40

E

---pq ,

---.

-20

E

0

1.2 1 0.6

0.6

0.4

-80

0.2

J

-100

0

c)

40

60 N d - r

BO 100 120 of c o e f f I c l e n t r

-

d)

0 20

40 Nd.r

60 80 100 of co.fflsl.nt.

120

140

Figure 7 The Evidence for data set X (see also table 1). (a) Log evidence for Legendre polynomials. Notice the evidence maximum. The gentle slope to the right is due to the "Occam factors" that penalize the increasing complexity of the model. (b) Log evidence for radial basis function models. Notice that there is no Occam penalty for the additional coefficients in these models, because increased density of radial basis functions does not make the model more powerful. The oscillations in the evidence are due to the details of the pixellation of the basis functions relative to the data points. (c) Log evidence for splines. The evidence is shown for the alternative splines regularizers p = 0 . . . 6 (see text). In the representation used, each spline model is obtained in the limit of an infinite number of coefficients. For example, p = 4 yields the cubic splines model. (d) Test error for splines. The number of data points in the test set was 90, cf. number of data points in training set = 37. The y-axis shows ED; the value of ED for the true interpolant has expectation 0.225 f 0.02. neural networks. For each individual model, the value of a is optimized, and the evidence is evaluated by integrating over a using the gaussian approximation. All logarithms are to base e.

6.1 Legendre Polynomials: Occam's Razor for the Number of Basis Functions. Figure 7a shows the evidence for Legendre polynomials of different degrees for data set X. The basis functions were chosen to be orthonormal on an interval enclosing the data, and a regularizer of the form Ew = C iwi was used.

David J. C. MacKay

438

Table I: Evidence for models interpolating data sets X and Y ! Data Set X Model Legendre polynomials

Best parameter values k = 38

basis functions Cauchy radial basis functions

k > 40, r = .25 k > 50, r = .27

Splines, p = 2 Splines, p = 3 Splines, p = 4 Splines, p = 5 Splines, p = 6

k > 80 k > 80 k > 80 k > 80 k > 80

Hermite functions

k=18

Neural networks

8 neurons, k = 25

Gaussian radial

Data Set Y

evidence

Log

Best parameter values

Log evidence

-47

k = 11

23.8

-28.8 i1.0 -18.9 f 1.0 -9.5 -5.6 -13.2 -24.9 -35.8 66

-

-12.6

k > 50, r = .77 k > 50, r = 1.1 k k k k k

> 50 > 50 > 50 > 50 > 50

27.1 f 1.0 25.7 f 1.0 8.2 19.8 22.1 21.8 20.4

k=3

42.2

6 neurons, k=19

25.7

‘All logs are natural. The evidence P(D I 1-i) is a density over D space, so the absolute value of the log evidence is arbitrary within an additive constant. Only differences in values of log evidences are relevant, relating directly to probability ratios.

Notice that an evidence maximum is obtained: beyond a certain number of terms, the evidence starts to decrease. This is the Bayesian Occam’s razor at work. The additional terms make the model more powerful, able to make more predictions. This power is automatically penalized. Notice the characteristic shape of the ”Occam hill.” On the left, the hill is steep as the oversimple models fail to fit the data; the penalty for misfitting the data scales as N, the number of data measurements. The other side of the hill is much less steep; the log Occam factors here only scale as klogN, where k is the number of parameters. We note in Table 1 the value of the maximum evidence achieved by these models, and move on to alternative models. The choice of orthonormal Legendre polynomials described above was motivated by a maximum entropy argument (Gull 1988). Models using other polynomial basis sets have also been tried. For less wellmotivated basis sets such as Hermite polynomials, it was found that the Occam factors were far bigger and the evidence was substantially smaller. If the size of the Occam factor increases rapidly with overparameterization, it is generally a sign that the space of alternative models is poorly matched to the problem.

Bayesian Interpolation

439

6.2 Fixed Radial Basis Functions. For a radial basis function or “kernel” model, the basis functions are 4rl(x)= g[(x- xh)/r]/r; here the xh are equally spaced over the range of interest. I examine two choices of g: a gaussian and a Cauchy function, 1/1 x2. We can quantitatively compare these alternative models of spatial correlation for any data set by evaluating the evidence. The regularizer is Ew = C iw;.Note that this model includes one new free parameter, r; in these demonstrations this parameter has been set to its most probable value (i.e., the value that maximizes the evidence). To penalize this free parameter an Occam factor is included, fiP(1og r)A log r, where A log r = posterior uncertainty in log Y, and P(1og r ) is the prior on log r, which is subjective to a small degree [I used P(1ogr) = 1/(4 f 2)]. This radial basis function model is the same as the ”intrinsic correlation” model of Charter (1991), Gull (1989a), and Sibisi (1991). Figure 7b shows the evidence as a function of the number of basis functions, k. Note that for these models there is not an increasing Occam penalty for large numbers of parameters. The reason for this is that these extra parameters do not make the model any more powerful (for fixed (Y and r). The increased density of basis functions does not enable the model to make any significant new predictions because the kernel g band-limits the possible interpolants.

+

6.3 Splines: Occam’s Razor for the Choice of Regularizer. The splines model was implemented as follows: let the basis functions be a Fourier set coshx, sinhx, h = 0,1,2,.... Use the regularizer Ew = C ~hPw&,,,) C ih”z~$,,,,~.If p = 4 then in the limit k -+ 00 we have the cubic splines regularizer E F ) = Jy”(x)*dx; if p = 2 we have the regularizer Ef) = J y ’ ( ~ ) ~ detc. x , Notice that the “nonparametric” splines model can easily be put in an explicit parameterized representation. Figure 7c shows the evidence for data set X as a function of the number of terms, for p = 0,1,2,3,4,6.Notice that in terms of Occam’s razor, both cases discussed above occur: for p = 0, 1, as k increases, the model becomes more powerful and there is an Occam penalty. For p = 3,4,6, increasing k gives rise to no penalty. The case p = 2 seems to be on the fence between the two. As p increases, the regularizer becomes more opposed to strong curvature. Once we reach p = 6, the model becomes improbable because the data demand sharp discontinuities. The evidence can choose the order of our splines regularizer for us. For this data set, it turns out that p = 3 is the most probable value of p , by a few multiples of e. In passing, the radial basis function models described above can be transformed into the splines models’ Fourier representation. If the radial basis function kernel is g(x) then the regularizer in the splines representation is Ew = C + W ; ( ~ ~ , , , ) Gwhere ~ * , Gh is the discrete Fourier transform of g.

+

i(z~;(~,~)

David J. C. MacKay

440

3 ,

I

2.5

2

1.5 I

Y I

x 1

-.5 -2

-1

0

1

2

3

4

5

x

Figure 8: Data set "Y," interpolated with splines, p = 5. The data set is shown with three typical interpolantsdrawn from the posterior probabilitydistribution. Contrast this with Figure 4b, in which the most probable interpolant is shown with its pointwise error bars.

6.4 Results for a Smoother Data Set. Figure 8 shows data set Y, which comes from a much smoother interpolant than data set X. Table 1 summarizes the evidence for the alternative models. We can confirm that the evidence behaves in a reasonable manner by noting the following differences between data sets X and Y In the splines family, the most probable value of p has shifted upward to the stiffer splines with p = 4 - 5, as we would intuitively expect. Legendre polynomials: an observant reader may have noticed that when data set X was modeled with Legendre polynomials, the most probable number of coefficients k = 38 was suspiciously similar to the number of data points N = 37. For data set Y, however, the most probable number of coefficients is 11, which confirms that the evidence does not always prefer the polynomial with k = N! Data set X behaved in this way because it is very poorly modeled by polynomials. The Hermite function model, which was a poor model for data set X, is now the most probable, by a long way (over a million times more probable). The reason for this is that actually the data were generated from a Hermite function!

Bayesian Interpolation

441

6.5 Why Bayes Cannot Systematically Reject the Truth. Let us ask a sampling theory question: if one of the models we offer to Bayes is actually true, i.e., it is the model from which the data were generated, then is it possible for Bayes to systematically (over the ensemble of possible data sets) prefer a false model? Clearly under a worst case analysis, a Bayesian‘s posterior may favor a false model. Furthermore, Skilling (1991) demonstrated that with some data sets a free form (maximum entropy) model can have greater evidence than the truth; but is it possible for this to happen in the typical case, as Skilling seems to claim? I will show that the answer is no, the effect that Skilling demonstrated cannot be systematic. To be precise, the expectation over possible data sets of the log evidence for the true model is greater than the expectation of the log evidence for any other fixed model (Osteyee and Good 1974).14

Proof. Suppose that the truth is actually 3-11. A single data set arrives and we compare the evidences for 3-11 and 3-12, a different fixed model. Both models may have free parameters, but this will be irrelevant to the argument. Intuitively we expect that the evidence for ‘XI, P ( D 1 3-11), should usually be greatest. Let us examine the difference in log evidence between and 3-12. The expectation of this difference, given that 3-1, is true, is

(Note that this integral implicitly integrates over all El’s parameters according to their prior distribution under XI.) Now it is well known that for normalized p and q, J p log p / q is minimized by setting q = p (Gibbs’ theorem). Therefore a distinct model 3-12 is never expected to systematically defeat the true model, for just the same reason that it is not wise to bet differently from the true odds. This result has two important implications. First, it gives us frequentist confidence in the ability of Bayesian methods on the average to identify the true model. Second, it provides a stringent test of numerical implementations of Bayesian model comparison: imagine that we have written a program that evaluates the evidence for models 3-11 and 3-12; then we can generate mock data from sources simulating 311 and 3-12 and evaluate the evidence; if there is any systematic bias, averaged over several mock data sets, for the estimated evidence to favor the false model, then we can be sure that our numerical implementation is not evaluating the evidence correctly. ‘‘Skilling’s result presumably occurred because the particular parameter values of the true model that generated the data were not typical of the prior used when evaluating the evidence for that model. In such a case, the log evidence difference can show a transient bias against the true model, for small quantities of data; such biases are usually corrected by greater quantities of data.

442

David J. C.MacKay

This issue is illustrated using data set Y. The "truth is that this data set was actually generated from a quadratic Hermite function, 1.1(1- x 2 ~ * ) e - ~By / ~ the . above argument the evidence ought probably to favor the model "the interpolant is a 3-coefficient Hermite function" over our other models. Table 1 shows the evidence for the true Hermite function model, and for other models. As already stated, the truth is indeed considerably more probable than the alternatives. Having demonstrated that Bayes cannot systematically fail when one of the models is true, we now examine the way in which this framework can fail, if none of the models offered to Bayes is any good.

+

6.6 Comparison with "Generalization Error". It is a popular and intuitive criterion for choosing between alternative interpolants (found using different models) to compare their errors on a test set that was not used to derive the interpolants. "Cross-validation" is a more refined and more computationally expensive version of this same idea. How does this method relate to the evaluation of the evidence described in this paper? Figure 7c displayed the evidence for the family of spline interpolants. Figure 7d shows the corresponding test error, measured on a test set with size over twice as big (90) as the "training" data set (37) used to determine the interpolant. A similar comparison was made in Figure 5b. Note that the overall trends shown by the evidence are matched by trends in the test error (if you flip one graph upside down). Also, for this particular problem, the ranks of the alternative spline models under the evidence are similar to their ranks under the test error. And in Figure 5b, the evidence maximum over 0 is surrounded by the test error minima. Thus this suggests that the evidence might be a reliable predictor of generalization ability. However, this is not necessarily the case. There are five reasons why the evidence and the test error might not be correlated. First, the test error is a noisy quantity. It is necessary to devote large quantities of data to the test set to obtain a reasonable signal-to-noise ratio. In Figure 5b more than twice as much data is in each test set but the difference in log (u between the two test error minima exceeds the size of the Bayesian confidence interval for log a. Second, the model with greatest evidence is not expected to be the best model all the time - Bayesian inferences are uncertain. The whole point of Bayes is that it quantifies precisely those uncertainties: the relative values of the evidence for alternative models express the plausibility of the models, given the data and the underlying assumptions. Third, there is more to the evidence than there is to the generalization error. For example, imagine that for two models, the most probable interpolants happen to be identical. In this case, the two solutions will have the same generalization error, but the evidence will not in general

Bayesian Interpolation

443

be the same: typically, the model that was a priori more complex will suffer a larger Occam factor and will have a smaller evidence. Fourth, the test error is a measure of performance only of the single most probable interpolant: the evidence is a measure of plausibility of the entire posterior ensemble around the best fit interpolant. Probably a stronger correlation between the evidence and the test statistic would be obtained if the test statistic used were the average of the test error over the posterior ensemble of solutions. This ensemble test error is not so easy to compute. The fifth and most interesting reason why the evidence might not be correlated with the generalization error is that there might be a flaw in the underlying assumptions such that the models being compared might all be poor models. If a poor regularizer is used, for example, one that is ill-matched to the statistics of the world, then the Bayesian choice of cy will often not be the best in terms of generalization error (Davies and Anderssen 1986; Gull 1989a; Haussler et al. 1991). Such a failure occurs in the companion paper on neural networks. What is our attitude to such a failure of Bayesian prediction? The failure of the evidence does not mean that we should discard Bayes’ rule and use the generalization error as our criterion for choosing a. A failure is an opportunity to learn; a healthy scientist actively searches for such failures, because they yield insights into the defects of the current model. The detection of such a failure (by evaluating the generalization error, for example) motivates the search for new models that do not fail in this way; for example, alternative regularizers can be tried until a model is found that makes the data more probable. If one uses the generalization error only as a criterion for model comparison, one is denied this mechanism for learning. The development of maximum entropy image deconvolution was held up for years because no one used the Bayesian choice of a; once the Bayesian choice of CY was used (Gull 1989a1, the results obtained were most dissatisfactory, making clear what a poor regularizer was being used; this motivated an immediate search for alternative priors; the new, more probable priors discovered by this search are now at the heart of the state of the art in image deconvolution (Weir 1991). 6.7 The Similarity between Regularization and “Early Stopping“. While an overparameterized model is fitted to a data set using gradient descent on the data error, it is sometimes noted that the model’s generalization error passes through a minimum, rather than decreasing monotonically. This is known as “overlearning” in the neural networks community, and some researchers advocate the use of “early stopping,” that is, stopping gradient descent before the data error minimum is reached, so as to try to obtain solutions with smaller generalization error. This author believes that “overlearning” should be viewed as a symptom of a model ill-matched to the data set, and that the appropriate re-

444

David J. C. MacKay

sponse is not to patch u p a bad model, but rather to search for models that are well matched to our data. In particular, the use of models incorporating simple regularizers is expected to give results qualitatively similar to early stopping. This can be seen by examining Figure 6. The regularizer moves the minimum of the objective function from WML to wMP; as the strength of the regularizer N is increased, WMP follows a knee-shaped trajectory from WML to the origin; a typical solution WMP is shown in Figure 6. If on the other hand gradient descent on the likelihood (data error) is used, and if the typical initial condition is close to the origin, then gradient descent will follow a similar knee-shaped trajectory. Thus qualitatively similar solutions are expected from increasingly early stopping and increasingly strong regularization with complete minimization. Regularization is to be preferred as a more robust, repeatable, and comprehensible procedure. 6.8 Admitting Neural Networks into the Canon of Bayesian Interpolation Models. A second paper will discuss how to apply this Bayesian framework to feedforward neural networks. Preliminary results using these methods are included in Table 1. Assuming that the approximations used were valid, it is interesting that the evidence for neural nets is actually good for both the spiky and the smooth data sets. Furthermore, neural nets, in spite of their arbitrariness, yield a relatively compact model, with fewer parameters needed than to specify the splines and radial basis function solutions.

7 Conclusions

The recently developed methods of Bayesian model comparison and regularization have been presented. Models can be ranked by evaluating the evidence, a solely data-dependent measure that intuitively and consistently combines a model's ability to fit the data with its complexity. The precise posterior probabilities of the models also depend on the subjective priors that we assign to them, but these terms are typically overwhelmed by the evidence. Regularizing constants are set by maximizing the evidence. For many regularization problems, the theory of the number of well-measured parameters makes it possible to perform this optimization on-line. In the interpolation examples discussed, the evidence was used to set the number of basis functions k in a polynomial model; to set the characteristic size Y in a radial basis function model; to choose the order p of the regularizer for a spline model, and to rank all these different models in the light of the data. Further work is needed to formalize the relationship of this framework to the pragmatic model comparison technique of cross-validation. Using the two techniques in parallel, it is possible to detect flaws in the

Bayesian Interpolation

445

underlying assumptions implicit in the data models being used. Such failures direct our search for superior models, providing a powerful tool for human learning. There are thousands of data modeling tasks waiting for the evidence to be evaluated. It will be exciting to see how much we can learn when this is done.

Acknowledgments I thank Mike Lewicki, Nick Weir and David R. T. Robinson for helpful conversations, and Andreas Herz for comments on the manuscript. I a m grateful to Dr. R. Goodman and Dr. I? Smyth for funding m y trip to Maxent 90. This work was supported by a Caltech Fellowship and a Studentship from SERC, UK. References Akaike, H. 1970. Statistical predictor identification. Ann. Inst. Statist. Math. 22, 203-21 7. Berger, J. 1985. Statistical Decision Theory and Bayesian Analysis. Springer, Berlin. Box, G. E. P., and Tiao, G. C. 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA. Bretthorst, G. L. 1990. Bayesian analysis. I. Parameter estimation using quadrature NMR models. 11. Signal detection and model selection. 111. Applications to NMR. J. Mag. Reson. 88(3), 533-595. Charter, M. K. 1991. Quantifying drug absorption. In Maximum Entropy and Bayesian Methods, Laramie, 2990, W. T. Grandy and L. H. Schick, eds., pp. 245252. KIuwer, Dordrecht. Cox, R. T. 1964. Probability, frequency, and reasonable expectation. Am. J. Phys. 14, 1-13. Davies, A. R., and Anderssen, R. S. 1986. Optimization in the regularization of ill-posed problems. J. Aust. Mat. SOC.Ser. B 28, 114-133. Eubank, R. L. 1988. Spline smoothing and non-parametric regression. Marcel Dekker, New York. Gull, S. F. 1988. Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering, Vol. 1: Foundations, G. J. Erickson and C. R. Smith, eds., pp. 53-74. Kluwer, Dordrecht. Gull, S. F. 1989a. Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods, Cambridge, 1988, J. Skilling, ed., pp. 53-71. Kluwer, Dordrecht. Gull, S. F. 1989b. Bayesian data analysis: Straight-line fitting. In Maximum Entropy and Bayesian Methods, Cambridge, 2988, J. Skilling, ed., pp. 511-518. Kluwer, Dordrecht. Gull, S. F., and Skilling, J. 1991. Quantified Maximum Entropy. MemSys5 User’s Manual. M.E.D.C., 33 North End, Royston, SG8 6NR, England.

446

David J. C. MacKay

Hanson, R., Stutz, J., and Cheeseman, P. 1991. Bayesian classification theory. NASA Ames TR FIA-90-12-7-01. Haussler, D., Kearns, M., and Schapire, R. 1991. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In Proceedings of the Fourth COLT Workshop. Morgan Kaufmann, San Mateo, CA. Jaynes, E. T. 1986. Bayesian methods: General background. In Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25. Cambridge University Press, Cambridge. Jeffreys, H. 1939. Theory of Probability. Oxford University Press, Oxford. Kashyap, R. L. 1977. A Bayesian comparison of different classes of dynamic models using empirical data. I E E E Transact.Automatic Control AC-22(5), 715727. Loredo, T. J. 1989. From Laplace to supernova SN 1987A: Bayesian inference in astrophysics. In Maximum Entropy and Bayesian Methods, P. Fougere, ed., pp. 81-142. Kluwer, Dordrecht. MacKay, D. J. C. 1992a. A practical Bayesian framework for backpropagation networks. Neural Comp. 4(3), 448-472. MacKay, D. J. C. 1992b. Information-based objective functions for active data selection. Neural Comp., to appear. Neal, R. M. 1991. Bayesian mixture modeling by Monte Carlo simulation. Preprint. Dept. of Computer Science, University of Toronto. Osteyee, D. B., and Good, I. J. 1974. Information, Weight of Evidence, the Singularity between Probability Measures and Signal Detection. Springer, Berlin. Patrick, J. D., and Wallace, C. S. 1982. Stone circle geometries: An information theory approach. In Archaeoastronomy in the Old World, D. C. Heggie, ed. Cambridge University Press, Cambridge. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature (London) 317(6035), 314-319. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465471. Schwarz, G. 1978. Estimating the dimension of a model. Ann. Stat. 6(2), 461464. Sibisi, S. 1991. Bayesian interpolation. In Maximum Entropy and Bayesian Methods, Laramie, 1990, W. T. Grandy, Jr., and L. H. Schick, eds., pp. 349-355. Kluwer, Dordrecht. Skilling, J. 1989. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian Methods, Cambridge, 1988, J. Skilling, ed., pp. 455-466. Kluwer, Dordrecht. Skilling, J. 1991. On parameter estimation and quantified MaxEnt. In Muximum Entropy and Bayesian Methods, Lararnie, 1990, W. T. Grandy, Jr., and L. H. Schick, eds., pp. 267-273. Kluwer, Dordrecht. Skilling, J.,Robinson, D. R. T., and Gull, S. F. 1991. Probabilistic displays. In Maximum Entropy and Bayesian Methods, Laramie, 1990, W. T. Grandy, Jr., and L. H. Schick, eds., pp. 365-368. Kluwer, Dordrecht. Stigler, S. M. 1986. Laplace’s 1774 memoir on inverse probability. Stat. Sci. 1(3), 359-378.

Bayesian Interpolation

447

Szeliski, R. 1989. Bayesian Modeling of Uncertainty in Low Level Vision. Kluwer, Dordrecht . Titterington, D. 1985. Common structure of smoothing techniques in statistics. Int. Statist. Rev. 53, 141-170. Walker, A. M. 1967. On the asymptotic behaviour of posterior distributions. J. R. Stat. SOC.B 31, 80-88. Wallace, C. S., and Boulton, D. M. 1968. An information measure for classification. Comput. J. 11(2), 185-194. Wallace, C. S., and Freeman, P. R. 1987. Estimation and inference by compact coding. J. R. Statist. SOC.B 49(3), 240-265. Weir, N. 1991. Applications of maximum entropy techniques to HST data. Proc. ESOIST-ECF Data Analysis Workshop, April 1991. Zellner, A. 1984. Basic Issues in Econometrics. University of Chicago Press, Chicago.

Received 21 May 1991; accepted 29 October 1991.

This article has been cited by: 1. Kentaro Katahira, Jun Nishikawa, Kazuo Okanoya, Masato Okada. 2010. Extracting State Transition Dynamics from Multiple Spike Trains Using Hidden Markov Models with Correlated Poisson DistributionExtracting State Transition Dynamics from Multiple Spike Trains Using Hidden Markov Models with Correlated Poisson Distribution. Neural Computation 22:9, 2369-2389. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary Content] 2. Krzysztof M. Graczyk, Piotr Płonski, Robert Sulej. 2010. Neural network parameterizations of electromagnetic nucleon form-factors. Journal of High Energy Physics 2010:9. . [CrossRef] 3. Lei Xu, Yanda Li. 2010. Emerging themes on information theory and Bayesian approach. Frontiers of Electrical and Electronic Engineering in China 5:3, 237-240. [CrossRef] 4. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 5. YouLiang Ding, Yang Deng, AiQun Li. 2010. Study on correlations of modal frequencies and environmental factors for a suspension bridge based on improved neural networks. Science China Technological Sciences 53:9, 2501-2509. [CrossRef] 6. S. Mohanty, A. Chattopadhyay, P. Peralta, S. Das. 2010. Bayesian Statistic Based Multivariate Gaussian Process Approach for Offline/Online Fatigue Crack Growth Prediction. Experimental Mechanics . [CrossRef] 7. Takeaki Shimokawa, Shinsuke Koyama, Shigeru Shinomoto. 2010. A characterization of the time-rescaled gamma process as a model for spike trains. Journal of Computational Neuroscience 29:1-2, 183-191. [CrossRef] 8. Lee Samuel Finn, Andrea N. Lommen. 2010. DETECTION, LOCALIZATION, AND CHARACTERIZATION OF GRAVITATIONAL WAVE BURSTS IN A PULSAR TIMING ARRAY. The Astrophysical Journal 718:2, 1400-1415. [CrossRef] 9. Jochen Rau. 2010. Evidence procedure for efficient quantum-state tomography. Physical Review A 82:1. . [CrossRef] 10. Antanas Verikas, Zivile Kalsyte, Marija Bacauskiene, Adas Gelzinis. 2010. Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey. Soft Computing 14:9, 995-1010. [CrossRef] 11. Manoj Khandelwal. 2010. Prediction of thermal conductivity of rocks by soft computing. International Journal of Earth Sciences . [CrossRef] 12. D. P. Vetrov, D. A. Kropotov, A. A. Osokin. 2010. Automatic determination of the number of components in the EM algorithm of restoration of a mixture of

normal distributions. Computational Mathematics and Mathematical Physics 50:4, 733-746. [CrossRef] 13. Sujit Ghosal, Sudipto Chaki. 2010. Estimation and optimization of depth of penetration in hybrid CO2 LASER-MIG welding using ANN-optimization hybrid model. The International Journal of Advanced Manufacturing Technology 47:9-12, 1149-1157. [CrossRef] 14. Michael Fernandez, Julio Caballero, Leyden Fernandez, Akinori Sarai. 2010. Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM). Molecular Diversity . [CrossRef] 15. Rui Li, Tai-Peng Tian, Stan Sclaroff, Ming-Hsuan Yang. 2010. 3D Human Motion Tracking with a Coordinated Mixture of Factor Analyzers. International Journal of Computer Vision 87:1-2, 170-190. [CrossRef] 16. Jean-Luc Schwartz. 2010. A reanalysis of McGurk data suggests that audiovisual fusion in speech perception is subject-dependent. The Journal of the Acoustical Society of America 127:3, 1584. [CrossRef] 17. Miquel Traveria, Alejandro Escribano, Pablo Palomo. 2010. Statistical wind forecast for Reus airport. Meteorological Applications n/a-n/a. [CrossRef] 18. Arvind Tolambiya, S. Venkatraman, Prem K. Kalra. 2010. Content-based image classification with wavelet relevance vector machines. Soft Computing 14:2, 129-136. [CrossRef] 19. Thadikamala Sathish, Reddy Shetty Prakasham. 2010. Enrichment of glutaminase production by Bacillus subtilis RSP-GLU in submerged cultivation based on neural networkâgenetic algorithm approach. Journal of Chemical Technology & Biotechnology 85:1, 50-58. [CrossRef] 20. Yen-Chun Chou, Chia-Feng Lu, Wan-Yuo Guo, Yu-Te Wu. 2010. Blind Source Separation of Hemodynamics from Magnetic Resonance Perfusion Brain Images Using Independent Factor Analysis. International Journal of Biomedical Imaging 2010, 1-10. [CrossRef] 21. Yongxiu He, Weijun Tao, Aiying Dai, Lifang Yang, Rui Fang, Furong Li. 2010. Risk comprehensive evaluation of urban network planning based on fuzzy Bayesian LS_SVM. Kybernetes 39:5, 707-722. [CrossRef] 22. M Gholamrezaei, K Ghorbanian. 2010. Compressor map generation using a feed-forward neural network and rig data. Proceedings of the Institution of Mechanical Engineers, Part A: Journal of Power and Energy 224:1, 97-108. [CrossRef] 23. Randa Herzallah. 2010. A probabilistic indirect adaptive control for systems with input-dependent noise. International Journal of Adaptive Control and Signal Processing n/a-n/a. [CrossRef] 24. D. Akçay Perdahcıoğlu, M. H. M. Ellenbroek, P. J. M. Hoogt, A. Boer. 2009. An optimization method for dynamics of structures with repetitive

component patterns. Structural and Multidisciplinary Optimization 39:6, 557-567. [CrossRef] 25. Manoj Khandelwal, D. Lalit Kumar, Mohan Yellishetty. 2009. Application of soft computing to predict blast-induced ground vibration. Engineering with Computers . [CrossRef] 26. C. E. Alciaturi, G. Quevedo. 2009. Bayesian regularization: application to calibration in NIR spectroscopy. Journal of Chemometrics 23:11, 562-568. [CrossRef] 27. N. J. Costiris, E. Mavrommatis, J. W. Clark. 2009. Decoding β-decay systematics: A global statistical model for β^{-} half-lives. Physical Review C 80:4. . [CrossRef] 28. Paulo J. G. Lisboa, Terence A. Etchells, Ian H. Jarman, Corneliu T. C. Arsene, M. S. Hane Aung, Antonio Eleuteri, Azzam F. G. Taktak, Federico Ambrogi, Patrizia Boracchi, Elia Biganzoli. 2009. Partial Logistic Artificial Neural Network for Competing Risks Regularized With Automatic Relevance Determination. IEEE Transactions on Neural Networks 20:9, 1403-1416. [CrossRef] 29. Mehdi Jalali-Heravi, Ahmad Mani-Varnosfaderani. 2009. QSAR Modeling of 1-(3,3-Diphenylpropyl)-Piperidinyl Amides as CCR5 Modulators Using Multivariate Adaptive Regression Spline and Bayesian Regularized Genetic Neural Networks. QSAR & Combinatorial Science 28:9, 946-958. [CrossRef] 30. F. Mateo, R. Gadea, Á. Medina, R. Mateo, M. Jiménez. 2009. Predictive assessment of ochratoxin A accumulation in grape juice based-medium by Aspergillus carbonarius using neural networks. Journal of Applied Microbiology 107:3, 915-927. [CrossRef] 31. B. Rivet, A. Souloumiac, V. Attina, G. Gibert. 2009. xDAWN Algorithm to Enhance Evoked Potentials: Application to Brain–Computer Interface. IEEE Transactions on Biomedical Engineering 56:8, 2035-2043. [CrossRef] 32. Takeaki Shimokawa, Shigeru Shinomoto. 2009. Estimating Instantaneous Irregularity of Neuronal FiringEstimating Instantaneous Irregularity of Neuronal Firing. Neural Computation 21:7, 1931-1951. [Abstract] [Full Text] [PDF] [PDF Plus] 33. F. R. Burden, D. A. Winkler. 2009. Optimal Sparse Descriptor Selection for QSAR Using Bayesian Methods. QSAR & Combinatorial Science 28:6-7, 645-653. [CrossRef] 34. Morten Mørup, Lars Kai Hansen. 2009. Automatic relevance determination for multi-way models. Journal of Chemometrics 23:7-8, 352-363. [CrossRef] 35. Yu Gong, Xia Hong. 2009. OFDM Joint Data Detection and Phase Noise Cancellation for Constant Modulus Modulations. IEEE Transactions on Signal Processing 57:7, 2864-2868. [CrossRef]

36. Jin Cheng, Q. S. Li. 2009. A hybrid artificial neural network method with uniform design for structural optimization. Computational Mechanics 44:1, 61-71. [CrossRef] 37. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 38. Huanhuan Chen, P. Tino, Xin Yao. 2009. Probabilistic Classification Vector Machines. IEEE Transactions on Neural Networks 20:6, 901-914. [CrossRef] 39. A. Asensio Ramos, C. Ramos Almeida. 2009. BayesCLUMPY: BAYESIAN INFERENCE WITH CLUMPY DUSTY TORUS MODELS. The Astrophysical Journal 696:2, 2075-2085. [CrossRef] 40. C.E. Pedreira, L. Macrini, M.G. Land, E.S. Costa. 2009. New Decision Support Tool for Treatment Intensity Choice in Childhood Acute Lymphoblastic Leukemia. IEEE Transactions on Information Technology in Biomedicine 13:3, 284-290. [CrossRef] 41. Kazuho Watanabe, Motoki Shiga, Sumio Watanabe. 2009. Upper bound for variational free energy of Bayesian networks. Machine Learning 75:2, 199-215. [CrossRef] 42. Marta Neve, Giuseppe De Nicolao, Giovanni Prodi, Carlo Siviero. 2009. Estimation of Engine Maps: A Regularized Basis-Function Networks Approach. IEEE Transactions on Control Systems Technology 17:3, 716-722. [CrossRef] 43. H. K. D. H. Bhadeshia. 2009. Neural Networks and Information in Materials Science. Statistical Analysis and Data Mining 1:5, 296-305. [CrossRef] 44. Jorge E. Rodríguez, Andrés L. Medaglia, Carlos A. Coello Coello. 2009. Design of a motorcycle frame using neuroacceleration strategies in MOEAs. Journal of Heuristics 15:2, 177-196. [CrossRef] 45. Roberto C. Sotero, Nelson J. Trujillo-Barreto, Juan C. Jiménez, Felix Carbonell, Rafael Rodríguez-Rojas. 2009. Identification and comparison of stochastic metabolic/hemodynamic models (sMHM) for the generation of the BOLD signal. Journal of Computational Neuroscience 26:2, 251-269. [CrossRef] 46. Ting Chen, Chong Zhang, Xia Chen, Liqing Li. 2009. An Input Variable Selection Method for the Artificial Neural Network of Shear Stiffness of Worsted Fabrics. Statistical Analysis and Data Mining 1:5, 287-295. [CrossRef] 47. J. Horn, O. De Jesus, M.T. Hagan. 2009. Spurious Valleys in the Error Surface of Recurrent Networks—Analysis and Avoidance. IEEE Transactions on Neural Networks 20:4, 686-700. [CrossRef] 48. Marc Brendel, Wolfgang Marquardt. 2009. An algorithm for multivariate function estimation based on hierarchically refined sparse grids. Computing and Visualization in Science 12:4, 137-153. [CrossRef]

49. Huiwen FANG, Hui LI, Yanwei LI, Jing ZHAO, Jian XU. 2009. Simultaneous Spectrophotometric Determination of Three Tolualdehyde Isomers by Artificial Neural Networks and Its Comparison with Partial Least Squares. Chinese Journal of Chemistry 27:3, 546-550. [CrossRef] 50. J. Shahrabi, S. S. Mousavi, M. Heydar. 2009. Supply Chain Demand Forecasting; A Comparison of Machine Learning Techniques and Traditional Methods. Journal of Applied Sciences 9:3, 521-527. [CrossRef] 51. Juan M. García-Gómez, Jan Luts, Margarida Julià-Sapé, Patrick Krooshof, Salvador Tortajada, Javier Vicente Robledo, Willem Melssen, Elies Fuster-García, Iván Olier, Geert Postma, Daniel Monleón, Àngel Moreno-Torres, Jesús Pujol, Ana-Paula Candiota, M. Carmen Martínez-Bisbal, Johan Suykens, Lutgarde Buydens, Bernardo Celda, Sabine Huffel, Carles Arús, Montserrat Robles. 2009. Multiproject–multicenter evaluation of automatic brain tumor classification by magnetic resonance spectroscopy. Magnetic Resonance Materials in Physics, Biology and Medicine 22:1, 5-18. [CrossRef] 52. Jen-Tzung Chien, Jung-Chun Chen. 2009. Recursive Bayesian Linear Regression for Adaptive Classification. IEEE Transactions on Signal Processing 57:2, 565-575. [CrossRef] 53. S. H. Suyu, P. J. Marshall, R. D. Blandford, C. D. Fassnacht, L. V. E. Koopmans, J. P. McKean, T. Treu. 2009. DISSECTING THE GRAVITATIONAL LENS B1608+656. I. LENS POTENTIAL RECONSTRUCTION. The Astrophysical Journal 691:1, 277-298. [CrossRef] 54. María A. Bobes, Yuriem Fernández García, Francisco Lopera, Yakeel T. Quiroz, Lídice Galán, Mayrim Vega, Nelson Trujillo, Mitchell Valdes-Sosa, Pedro Valdes-Sosa. 2009. ERP generator anomalies in presymptomatic carriers of the Alzheimer's disease E280A PS-1 mutation. Human Brain Mapping NA-NA. [CrossRef] 55. Yanjuan Guo, Edmund K. M. Chang, Stephen S. Leroy. 2009. How strong are the Southern Hemisphere storm tracks?. Geophysical Research Letters 36:22. . [CrossRef] 56. P. Thiruvalar Selvan, Singaravelu Raghavan. 2009. MULTILAYER PERCEPTRON NEURAL ANALYSIS OF EDGE COUPLED AND CONDUCTOR-BACKED EDGE COUPLED COPLANAR WAVEGUIDES. Progress In Electromagnetics Research B 17, 169-185. [CrossRef] 57. Mohsen Botlani Esfahani, Mohammad Reza Toroghinejad, Shahram Abbasi. 2009. Artificial Neural Network Modeling the Tensile Strength of Hot Strip Mill Products. ISIJ International 49:10, 1583-1587. [CrossRef] 58. P. Thiruvalar Selvan, Singaravelu Raghavan. 2009. NEURAL MODEL FOR CIRCULAR-SHAPED MICROSHIELD AND CONDUCTOR-BACKED COPLANAR WAVEGUIDE. Progress In Electromagnetics Research M 8, 119-129. [CrossRef]

59. Y. Q. Ni, H. F. Zhou, J. M. Ko. 2009. Generalization Capability of Neural Network Models for Temperature-Frequency Correlation Using Monitoring Data. Journal of Structural Engineering 135:10, 1290. [CrossRef] 60. Shihao Ji, David Dunson, Lawrence Carin. 2009. Multitask Compressive Sensing. IEEE Transactions on Signal Processing 57:1, 92-106. [CrossRef] 61. H. Chin, Y. Hwang, S. Deng. 2009. Heat Flux Estimation in Nonlinear Materials Using Kalman Filter-Enhanced Neural Network. Journal of Thermophysics and Heat Transfer 23:1, 41-49. [CrossRef] 62. Yungang Cao, Xiuchun Yang, Xiaohua Zhu. 2008. Retrieval snow depth by artificial neural network methodology from integrated AMSR-E and in-situ data—A case study in Qinghai-Tibet Plateau. Chinese Geographical Science 18:4, 356-360. [CrossRef] 63. H. U. Nørgaard-Nielsen, H. E. Jørgensen. 2008. Foreground removal from CMB temperature maps using an MLP neural network. Astrophysics and Space Science 318:3-4, 195-206. [CrossRef] 64. Iván Olier, Alfredo Vellido. 2008. Variational Bayesian Generative Topographic Mapping. Journal of Mathematical Modelling and Algorithms 7:4, 371-387. [CrossRef] 65. Noslen Hernández, Isneri Talavera, Angel Dago, Rolando J. Biscay, Marcia M. Castro Ferreira, Diana Porro. 2008. Relevance vector machines for multivariate calibration purposes. Journal of Chemometrics 22:11-12, 686-694. [CrossRef] 66. Clarissa Daisy Costa Albuquerque, Galba Maria Campos-Takaki, Ana Maria Frattini Fileti. 2008. On-line biomass estimation in biosurfactant production process by Candida lipolytica UCP 988. Journal of Industrial Microbiology & Biotechnology 35:11, 1425-1433. [CrossRef] 67. Zhong-ze CHEN. 2008. Neural networks based 3D posture reconstruction from orthogonal images of human performer. Journal of Computer Applications 28:5, 1251-1254. [CrossRef] 68. Konstantinos P. Demestichas, Artemis A. Koutsorodi, Evgenia F. Adamopoulou, Michael E. Theologou. 2008. Modelling user preferences and configuring services in B3G devices. Wireless Networks 14:5, 699-713. [CrossRef] 69. P. Guarneri, G. Rocca, M. Gobbi. 2008. A Neural-Network-Based Model for the Dynamic Simulation of the Tire/Suspension System While Traversing Road Irregularities. IEEE Transactions on Neural Networks 19:9, 1549-1563. [CrossRef] 70. Jyh-Ying Peng, John A. D. Aston, Roger N. Gunn, Cheng-Yuan Liou, John Ashburner. 2008. Dynamic Positron Emission Tomography Data-Driven Analysis Using Sparse Bayesian Learning. IEEE Transactions on Medical Imaging 27:9, 1356-1369. [CrossRef] 71. Michael Fernández, Leyden Fernández, Julio Caballero, José Ignacio Abreu, Grethel Reyes. 2008. Proteochemometric Modeling of

the Inhibition Complexes of Matrix Metalloproteinases with N -Hydroxy-2-[(Phenylsulfonyl)Amino]Acetamide Derivatives Using Topological Autocorrelation Interaction Matrix and Model Ensemble Averaging. Chemical Biology & Drug Design 72:1, 65-78. [CrossRef] 72. Annie-Claude Parent, François Anctil, Véronique Cantin, Marie-Amélie Boucher. 2008. Neural Network Input Selection for Hydrological Forecasting Affected by Snowmelt. Journal of the American Water Resources Association 44:3, 679-688. [CrossRef] 73. R. Herzallah, D. Lowe. 2008. A Bayesian Perspective on Stochastic Neurocontrol. IEEE Transactions on Neural Networks 19:5, 914-924. [CrossRef] 74. Shian-Chang Huang, Tung-Kuang Wu. 2008. Combining wavelet-based feature extractions with relevance vector machines for stock index forecasting. Expert Systems 25:2, 133-149. [CrossRef] 75. Yonghua Wang, Yan Li, Jun Ding, Yuan Wang, Yaqing Chang. 2008. Prediction of binding affinity for estrogen receptor α modulators using statistical learning approaches. Molecular Diversity 12:2, 93-102. [CrossRef] 76. Afshin Nikseresht, Marc Gelgon. 2008. Gossip-Based Computation of a Gaussian Mixture Model for Distributed Multimedia Indexing. IEEE Transactions on Multimedia 10:3, 385-392. [CrossRef] 77. Jun'ichi Fukuda, Shin'ichi Miyazaki, Tomoyuki Higuchi, Teruyuki Kato. 2008. Geodetic inversion for space-time distribution of fault slip with time-varying smoothing regularization. Geophysical Journal International 173:1, 25-48. [CrossRef] 78. Ch. Subba Rao, T. Sathish, M. Mahalaxmi, G. Suvarna Laxmi, R. Sreenivas Rao, R.S. Prakasham. 2008. Modelling and optimization of fermentation factors for enhancement of alkaline protease production by isolated Bacillus circulans using feed-forward neural network and genetic algorithm. Journal of Applied Microbiology 104:3, 889-898. [CrossRef] 79. Sabri Kaya, Mustafa Turkmen, Kerim Guney, Celal Yildiz. 2008. NEURAL MODELS FOR THE ELLIPTIC- AND CIRCULAR-SHAPED MICROSHIELD LINES. Progress In Electromagnetics Research B 6, 169-181. [CrossRef] 80. Xiang Lian, Lei Chen. 2008. Efficient Similarity Search over Future Stream Time Series. IEEE Transactions on Knowledge and Data Engineering 20:1, 40-54. [CrossRef] 81. Y. Hwang, S. Deng. 2008. Applying Neural Networks to the Solution of the Inverse Heat Conduction Problem in a Gun Barrel. Journal of Pressure Vessel Technology 130:3, 031203. [CrossRef] 82. Guang Deng. 2008. Sequential and Adaptive Learning Algorithms for M-Estimation. EURASIP Journal on Advances in Signal Processing 2008, 1-14. [CrossRef]

83. Shivam Tripathi, Rao S. Govindaraju. 2008. Engaging uncertainty in hydrologic data sets using principal component analysis: BaNPCA algorithm. Water Resources Research 44:10. . [CrossRef] 84. Kerim Guney, Nurcan Sarikaya. 2008. CONCURRENT NEURO-FUZZY SYSTEMS FOR RESONANT FREQUENCY COMPUTATION OF RECTANGULAR, CIRCULAR, AND TRIANGULAR MICROSTRIP ANTENNAS. Progress In Electromagnetics Research PIER 84, 253-277. [CrossRef] 85. Jan Drugowitsch, Alwyn M. Barry. 2007. A formal framework and extensions for function approximation in learning classifier systems. Machine Learning 70:1, 45-88. [CrossRef] 86. Guenther A. Hoffmann, Kishor S. Trivedi, Miroslaw Malek. 2007. A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56:4, 615-628. [CrossRef] 87. Gan-ping Li, Wei-wu Yan, Hui-he Shao. 2007. Iterative optimal control based on support vector machine modeling within the Bayesian evidence framework. Journal of Shanghai University (English Edition) 11:6, 591-596. [CrossRef] 88. I. Di Martino, J. W. Brooks, P. A. S. Reed, P. Holdway, A. Wisbey. 2007. Adaptive numerical modelling of high temperature strength, creep and fatigue behaviour in Ni based superalloys. Materials Science and Technology 23:12, 1402-1407. [CrossRef] 89. Ana S. Lukic, Miles N. Wernick, Dimitris G. Tzikas, Xu Chen, Aristidis Likas, Nikolas P. Galatsanos, Yongyi Yang, Fuqiang Zhao, Stephen C. Strother. 2007. Bayesian Kernel Methods for Analysis of Functional Neuroimages. IEEE Transactions on Medical Imaging 26:12, 1613-1624. [CrossRef] 90. Tian Jin, Zhimin Zhou. 2007. Ultrawideband Synthetic Aperture Radar Landmine Detection. IEEE Transactions on Geoscience and Remote Sensing 45:11, 3561-3573. [CrossRef] 91. John Y. Goulermas, Panos Liatsis, Xiao-Jun Zeng, Phil Cook. 2007. Density-Driven Generalized Regression Neural Networks (DD-GRNN) for Function Approximation. IEEE Transactions on Neural Networks 18:6, 1683-1696. [CrossRef] 92. Giuseppe Nunnari, Flavio CannavÓ. 2007. A New Cost Function for Air Quality Modeling. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 49:2, 281-290. [CrossRef] 93. Biao Yang, Zengke Zhang, Zhengshun Sun. 2007. Robust Relevance Vector Regression With Trimmed Likelihood Function. IEEE Signal Processing Letters 14:10, 746-749. [CrossRef] 94. Heng-Chao Li, Wen Hong, Yi-Rong Wu, Heng-Ming Tai. 2007. Texture-Preserving Despeckling of SAR Images Using Evidence Framework. IEEE Geoscience and Remote Sensing Letters 4:4, 537-541. [CrossRef]

95. Shigeru Shinomoto, Shinsuke Koyama. 2007. A solution to the controversy between rate and temporal coding. Statistics in Medicine 26:21, 4032-4038. [CrossRef] 96. Srikantan S. Nagarajan, Hagai T. Attias, Kenneth E. Hild, Kensuke Sekihara. 2007. A probabilistic algorithm for robust interference suppression in bioelectromagnetic sensor data. Statistics in Medicine 26:21, 3886-3910. [CrossRef] 97. Huaien Luo, Sadasivan Puthusserypady. 2007. fMRI Data Analysis With Nonstationary Noise Models: A Bayesian Approach. IEEE Transactions on Biomedical Engineering 54:9, 1621-1630. [CrossRef] 98. Steven C. Gustafson, David R. Parker, Richard K. Martin. 2007. Cardinal Interpolation. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:9, 1538-1545. [CrossRef] 99. R. C. Dimitriu, H. K. D. H. Bhadeshia. 2007. Hot strength of creep resistant ferritic steels and relationship to creep rupture data. Materials Science and Technology 23:9, 1127-1131. [CrossRef] 100. O. M. Vasil’ev, D. P. Vetrov, D. A. Kropotov. 2007. Knowledge representation and acquisition in expert systems for pattern recognition. Computational Mathematics and Mathematical Physics 47:8, 1373-1397. [CrossRef] 101. Hau-San Wong, Bo Ma, Zhiwen Yu, Pui Fong Yeung, Horace H. S. Ip. 2007. 3-D Head Model Retrieval Using a Single Face View Query. IEEE Transactions on Multimedia 9:5, 1026-1036. [CrossRef] 102. S. Chatterjee, M. Murugananth, H. K. D. H. Bhadeshia. 2007. δ TRIP steel. Materials Science and Technology 23:7, 819-827. [CrossRef] 103. David P. Wipf, Bhaskar D. Rao. 2007. An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem. IEEE Transactions on Signal Processing 55:7, 3704-3716. [CrossRef] 104. T. N. Singh, A. K. Verma, P. K. Sharma. 2007. A neuro-genetic approach for prediction of time dependent deformational characteristic of rock and its sensitivity analysis. Geotechnical and Geological Engineering 25:4, 395-407. [CrossRef] 105. Carlos Garcia-Mateo, Carlos Capdevila, Francisca Garcia Caballero, Carlos García Andrés. 2007. Artificial neural network modeling for the prediction of critical transformation temperatures in steels. Journal of Materials Science 42:14, 5391-5397. [CrossRef] 106. Matthew C Coleman, David E Block. 2007. Nonlinear experimental design using Bayesian regularized neural networks. AIChE Journal 53:6, 1496-1509. [CrossRef] 107. Kerim Guney, Celal Yildiz, Sabri Kaya, Mustafa Turkmen. 2007. Neural models for the V-shaped conductor-backed coplanar waveguides. Microwave and Optical Technology Letters 49:6, 1294-1299. [CrossRef]

108. Leyden Fernández, Julio Caballero, José Ignacio Abreu, Michael Fernández. 2007. Amino acid sequence autocorrelation vectors and bayesian-regularized genetic neural networks for modeling protein conformational stability: Gene V protein mutants. Proteins: Structure, Function, and Bioinformatics 67:4, 834-852. [CrossRef] 109. Kamban Parasuraman, Amin Elshorbagy, Sean Carey. 2007. Modelling the dynamics of the evapotranspiration process using genetic programming / Modelisation de la dynamique du processus evapotranspiratoire par programmation genetique. Hydrological Sciences Journal 52:3, 563-578. [CrossRef] 110. Simon Dye, Steve Warren. 2007. Constraints on Dark and Visible Mass in Galaxies from Strong Gravitational Lensing. Proceedings of the International Astronomical Union 3:S244. . [CrossRef] 111. Jin Cheng, C. S. Cai, Ru-Cheng Xiao. 2007. Estimation of cable safety factors of suspension bridges using artificial neural network-based inverse reliability method. International Journal for Numerical Methods in Engineering 70:9, 1112-1133. [CrossRef] 112. K. Guney, C. Yildiz, S. Kaya, M. Turkmen. 2007. NEURAL MODELS FOR THE BROADSIDE-COUPLED V-SHAPED MICROSHIELD COPLANAR WAVEGUIDES. International Journal of Infrared and Millimeter Waves 27:9, 1241-1255. [CrossRef] 113. S Koyama, T Shimokawa, S Shinomoto. 2007. Phase transitions in the estimation of event rate: a path integral analysis. Journal of Physics A: Mathematical and Theoretical 40:20, F383-F390. [CrossRef] 114. Ueli Meier, Andrew Curtis, Jeannot Trampert. 2007. Global crustal thickness from neural network inversion of surface wave data. Geophysical Journal International 169:2, 706-722. [CrossRef] 115. K. Guney, S. S. Gultekin. 2007. A comparative study of neural networks for input resistance computation of electrically thin and thick rectangular microstrip antennas. Journal of Communications Technology and Electronics 52:5, 483-492. [CrossRef] 116. Shinichi Nakajima, Sumio Watanabe. 2007. Variational Bayes Solution of Linear Neural Networks and Its Generalization PerformanceVariational Bayes Solution of Linear Neural Networks and Its Generalization Performance. Neural Computation 19:4, 1112-1153. [Abstract] [PDF] [PDF Plus] 117. Michael Fernández, Julio Caballero. 2007. QSAR models for predicting the activity of non-peptide luteinizing hormone-releasing hormone (LHRH) antagonists derived from erythromycin A using quantum chemical properties. Journal of Molecular Modeling 13:4, 465-476. [CrossRef] 118. Kerim Guney, Nurcan Sarikaya. 2007. A Hybrid Method Based on Combining Artificial Neural Network and Fuzzy Inference System for Simultaneous Computation of Resonant Frequencies of Rectangular, Circular, and Triangular

Microstrip Antennas. IEEE Transactions on Antennas and Propagation 55:3, 659-668. [CrossRef] 119. Ting Chen, Liqing Li, Ludovic Koehl, Philippe Vroman, Xianyi Zeng. 2007. A soft computing approach to model the structure–property relations of nonwoven fabrics. Journal of Applied Polymer Science 103:1, 442-450. [CrossRef] 120. Randa Herzallah, David Lowe. 2007. Distribution Modeling of Nonlinear Inverse Controllers Under a Bayesian Framework. IEEE Transactions on Neural Networks 18:1, 107-114. [CrossRef] 121. Ming Ye, Raziuddin Khaleel, Marcel G. Schaap, Jianting Zhu. 2007. Simulation of field injection experiments in heterogeneous unsaturated media using cokriging and artificial neural network. Water Resources Research 43:7. . [CrossRef] 122. R Potenza, J F Dunne, S Vulli, D Richardson, P King. 2007. Multicylinder engine pressure reconstruction using NARX neural networks and crank kinematics. International Journal of Engine Research 8:6, 499-518. [CrossRef] 123. Julio Caballero, Alain Tundidor-Camba, Michael Fernández. 2007. Modeling of the Inhibition Constant (Ki) of Some Cruzain Ketone-Based Inhibitors Using 2D Spatial Autocorrelation Vectors and Data-Diverse Ensembles of Bayesian-Regularized Genetic Neural Networks. QSAR & Combinatorial Science 26:1, 27-40. [CrossRef] 124. Dmitriy Shutin, Gernot Kubin, Bernard H. Fleury. 2007. Application of the Evidence Procedure to the Estimation of Wireless Channels. EURASIP Journal on Advances in Signal Processing 2007, 1-24. [CrossRef] 125. Zongzhao Zhou, Yew Soon Ong, Prasanth B. Nair, Andy J. Keane, Kai Yew Lum. 2007. Combining Global and Local Surrogate Models to Accelerate Evolutionary Optimization. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 37:1, 66-76. [CrossRef] 126. Xiaoyuan Zhu, Cuntai Guan, Jiankang Wu, Yimin Cheng, Yixiao Wang. 2007. Expectation-Maximization Method for EEG-Based Continuous Cursor Control. EURASIP Journal on Advances in Signal Processing 2007, 1-11. [CrossRef] 127. Xia Hong, Sheng Chen, Chris J. Harris. 2007. A Kernel-Based Two-Class Classifier for Imbalanced Data Sets. IEEE Transactions on Neural Networks 18:1, 28-41. [CrossRef] 128. Ramazan Gencay, Rajna Gibson. 2007. Model Risk for European-Style Stock Index Options. IEEE Transactions on Neural Networks 18:1, 193-202. [CrossRef] 129. S. S. Leroy, J. G. Anderson. 2007. Estimating Eliassen-Palm flux using COSMIC radio occultation. Geophysical Research Letters 34:10. . [CrossRef] 130. Jean-Philippe Boulanger, Fernando Martinez, Olga Penalba, Enrique Carlos Segura. 2006. Neural network based daily precipitation generator (NNGEN-P). Climate Dynamics 28:2-3, 307-324. [CrossRef]

131. Trevor C. Bailey, Richard M. Everson, Jonathan E. Fieldsend, Wojtek J. Krzanowski, Derek Partridge, Vitaly Schetinin. 2006. Representing classifier confidence in the safety critical domain: an illustration from mortality prediction in trauma cases. Neural Computing and Applications 16:1, 1-10. [CrossRef] 132. Michael Fernández, Julio Caballero. 2006. Ensembles of Bayesian-regularized Genetic Neural Networks for Modeling of Acetylcholinesterase Inhibition by Huprines. Chemical Biology & Drug Design 68:4, 201-212. [CrossRef] 133. S. H. Suyu, P. J. Marshall, M. P. Hobson, R. D. Blandford. 2006. A Bayesian analysis of regularized source inversions in gravitational lensing. Monthly Notices of the Royal Astronomical Society 371:2, 983-998. [CrossRef] 134. J. Li, M.T. Manry, P.L. Narasimha, C. Yu. 2006. Feature Selection Using a Piecewise Linear Network. IEEE Transactions on Neural Networks 17:5, 1101-1115. [CrossRef] 135. Jean-Philippe Boulanger, Fernando Martinez, Enrique C. Segura. 2006. Projection of future climate change conditions using IPCC simulations, neural networks and Bayesian statistics. Part 1: Temperature mean state and seasonal cycle in South America. Climate Dynamics 27:2-3, 233-259. [CrossRef] 136. Malcolm R. Haylock, Gavin C. Cawley, Colin Harpham, Rob L. Wilby, Clare M. Goodess. 2006. Downscaling heavy precipitation over the United Kingdom: a comparison of dynamical and statistical methods and their future scenarios. International Journal of Climatology 26:10, 1397-1415. [CrossRef] 137. Farzan Aminian, E. Dante Suarez, Mehran Aminian, Daniel T. Walz. 2006. Forecasting Economic Data with Neural Networks. Computational Economics 28:1, 71-88. [CrossRef] 138. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 139. A. Bermak, S.B. Belhouari. 2006. Bayesian Learning Using Gaussian Process for Gas Identification. IEEE Transactions on Instrumentation and Measurement 55:3, 787-792. [CrossRef] 140. R. Begg, J. Kamruzzaman. 2006. Neural networks for detection and classification of walking pattern changes due to ageing. Australasian Physics & Engineering Sciences in Medicine 29:2, 188-195. [CrossRef] 141. Min Xu, Guangming Zeng, Xinyi Xu, Guohe Huang, Ru Jiang, Wei Sun. 2006. Application of Bayesian Regularized BP Neural Network Model for Trend Analysis, Acidity and Chemical Composition of Precipitation in North Carolina. Water, Air, and Soil Pollution 172:1-4, 167-184. [CrossRef] 142. Heung-Fai Lam, Ka-Veng Yuen, James L. Beck. 2006. Structural Health Monitoring via Measured Ritz Vectors Utilizing Artificial Neural Networks. Computer-Aided Civil and Infrastructure Engineering 21:4, 232-241. [CrossRef]

143. S. Kar, T. Searles, E. Lee, G. B. Viswanathan, H. L. Fraser, J. Tiley, R. Banerjee. 2006. Modeling the tensile properties in β-processed α/β Ti alloys. Metallurgical and Materials Transactions A 37:3, 559-566. [CrossRef] 144. P. Lauret, E. Fock, T.A. Mara. 2006. A Node Pruning Algorithm Based on a Fourier Amplitude Sensitivity Test Method. IEEE Transactions on Neural Networks 17:2, 273-293. [CrossRef] 145. G.C. Cawley, N.L.C. Talbot, G.J. Janacek, M.W. Peck. 2006. Sparse Bayesian Kernel Survival Analysis for Modeling the Growth Domain of Microbial Pathogens. IEEE Transactions on Neural Networks 17:2, 471-481. [CrossRef] 146. Xiang-jun Wen, Yu-nong Zhang, Wei-wu Yan, Xiao-ming Xu. 2006. Nonlinear decoupling controller design based on least squares support vector regression. Journal of Zhejiang University SCIENCE A 7:2, 275-284. [CrossRef] 147. E. Keehan, L. Karlsson, H.-O. Andrén, H. K. D. H. Bhadeshia. 2006. Influence of carbon, manganese and nickel on microstructure and properties of strong steel weld metals: Part 3 – Increased strength resulting from carbon additions. Science and Technology of Welding & Joining 11:1, 19-24. [CrossRef] 148. Jack C. Schryver, Craig C. Brandt, Susan M. Pfiffner, Anthony V. Palumbo, Aaron D. Peacock, David C. White, James P. McKinley, Philip E. Long. 2006. Application of Nonlinear Analysis Methods for Identifying Relationships Between Microbial Community Structure and Groundwater Geochemistry. Microbial Ecology 51:2, 177-188. [CrossRef] 149. Matthew C. Coleman, David E. Block. 2006. Bayesian parameter estimation with informative priors for nonlinear systems. AIChE Journal 52:2, 651-667. [CrossRef] 150. Jianting Guo, Jieshan Hou, Lanzhang Zhou, Hengqiang Ye. 2006. Prediction and Improvement of Mechanical Properties of Corrosion Resistant Superalloy K44 with Adjusting Minor Additions C, B and Hf. MATERIALS TRANSACTIONS 47:1, 198-206. [CrossRef] 151. Marcelo C. Medeiros, Timo Teräsvirta, Gianluigi Rech. 2006. Building neural network models for time series: a statistical approach. Journal of Forecasting 25:1, 49-75. [CrossRef] 152. O. Seidou, T. B. M. J. Ouarda, L. Bilodeau, M. Hessami, A. St-Hilaire, P. Bruneau. 2006. Modeling ice growth on Canadian lakes using artificial neural networks. Water Resources Research 42:11. . [CrossRef] 153. Julio Caballero, Michael Fernández. 2006. Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks. Journal of Molecular Modeling 12:2, 168-181. [CrossRef] 154. Mohammad Sajjad Khan, Paulin Coulibaly. 2006. Bayesian neural network for rainfall-runoff modeling. Water Resources Research 42:7. . [CrossRef]

155. Jian-hua Xu, Xue-gong Zhang, Yan-da Li. 2006. Regularized Kernel Forms of Minimum Squared Error Method. Frontiers of Electrical and Electronic Engineering in China 1:1, 1-7. [CrossRef] 156. Jean-Luc Schwartz. 2006. The 0�0 problem in the fuzzy-logical model of perception. The Journal of the Acoustical Society of America 120:4, 1795. [CrossRef] 157. Philippe Lauret, Mathieu David, Eric Fock, Alain Bastide, Carine Riviere. 2006. Bayesian and Sensitivity Analysis Approaches to Modeling the Direct Solar Irradiance. Journal of Solar Energy Engineering 128:3, 394. [CrossRef] 158. Joarder Kamruzzaman, Rezaul K. Begg. 2006. Support Vector Machines and Other Pattern Recognition Approaches to the Diagnosis of Cerebral Palsy Gait. IEEE Transactions on Biomedical Engineering 53:12, 2479-2490. [CrossRef] 159. H. S. Wang, J. R. Yang, H. K. D. H. Bhadeshia. 2005. Characterisation of severely deformed austenitic stainless steel wire. Materials Science and Technology 21:11, 1323-1328. [CrossRef] 160. Julio Caballero, Miguel Garriga, Michael Fernández. 2005. Genetic neural network modeling of the selective inhibition of the intermediate-conductance Ca2+-activated K+ channel by some triarylmethanes using topological charge indexes descriptors. Journal of Computer-Aided Molecular Design 19:11, 771-789. [CrossRef] 161. Ilya Nemenman . 2005. Fluctuation-Dissipation Theorem and Models of LearningFluctuation-Dissipation Theorem and Models of Learning. Neural Computation 17:9, 2006-2033. [Abstract] [PDF] [PDF Plus] 162. Andreas Lindemann, Christian L. Dunis, Paulo Lisboa. 2005. Level estimation, classification and probability distribution architectures for trading the EUR/USD exchange rate. Neural Computing and Applications 14:3, 256-271. [CrossRef] 163. T. N. Singh, A. K. Verma. 2005. Prediction of creep characteristic of rock under varying environment. Environmental Geology 48:4-5, 559-568. [CrossRef] 164. Shinsuke Koyama, Shigeru Shinomoto. 2005. Empirical Bayes interpretations of random point events. Journal of Physics A: Mathematical and General 38:29, L531-L537. [CrossRef] 165. 2005. A Technique for Pattern Recognition of Concrete Surface Cracks. Journal of the Korea Concrete Institute 17:3, 369-374. [CrossRef] 166. K. Yamazaki, S. Watanabe. 2005. Singularities in Complete Bipartite Graph-Type Boltzmann Machines and Upper Bounds of Stochastic Complexities. IEEE Transactions on Neural Networks 16:2, 312-324. [CrossRef] 167. Yong-Hua Wang, Yan Li, Sheng-Li Yang, Ling Yang. 2005. An in silico approach for screening flavonoids as P-glycoprotein inhibitors based on a Bayesian-regularized neural network. Journal of Computer-Aided Molecular Design 19:3, 137-147. [CrossRef]

168. Celal Yildiz, Mustafa Turkmen. 2005. Very accurate and simple CAD models based on neural networks for coplanar waveguide synthesis. International Journal of RF and Microwave Computer-Aided Engineering 15:2, 218-224. [CrossRef] 169. L. Zhang, P.B. Luh. 2005. Neural Network-Based Market Clearing Price Prediction and Confidence Interval Estimation With an Improved Extended Kalman Filter Method. IEEE Transactions on Power Systems 20:1, 59-66. [CrossRef] 170. Konstantinos Blekas, Dimitrios I. Fotiadis, Aristidis Likas. 2005. Motif-Based Protein Sequence Classification Using Neural Networks. Journal of Computational Biology 12:1, 64-82. [CrossRef] 171. T. N. Singh, R. Kanchan, A. K. Verma, K. Saigal. 2005. A comparative study of ANN and Neuro-fuzzy for the prediction of dynamic constant of rockmass. Journal of Earth System Science 114:1, 75-86. [CrossRef] 172. Anthony T. C. Goh, Fred H. Kulhawy, C. G. Chua. 2005. Bayesian Neural Network Analysis of Undrained Side Resistance of Drilled Shafts. Journal of Geotechnical and Geoenvironmental Engineering 131:1, 84. [CrossRef] 173. Abedalrazq Khalil, Mac McKee, Mariush Kemblowski, Tirusew Asefa. 2005. Sparse Bayesian learning machine for real-time management of reservoir releases. Water Resources Research 41:11. . [CrossRef] 174. G.G. Yen, L.-W. Ho. 2005. Online Fault Accomodation Control of Catastrophic System Failures. Control and Intelligent Systems 33:2. . [CrossRef] 175. M.C. Medeiros, A. Veiga. 2005. A Flexible Coefficient Smooth Transition Time Series Model. IEEE Transactions on Neural Networks 16:1, 97-113. [CrossRef] 176. François Anctil, Alexandre Rat. 2005. Evaluation of Neural Network Streamflow Forecasting on 47 Watersheds. Journal of Hydrologic Engineering 10:1, 85. [CrossRef] 177. C. Xiang, S. Ding, T.H. Lee. 2005. Geometrical Interpretation and Architecture Selection of MLP. IEEE Transactions on Neural Networks 16:1, 84-96. [CrossRef] 178. C. CAPDEVILA, F. G. CABALLERO, C. GARCÍA DE ANDRÉS. 2005. Neural Network Model for Isothermal Pearlite Transformation. Part II: Growth Rate. ISIJ International 45:2, 238-247. [CrossRef] 179. C. CAPDEVILA, F. G. CABALLERO, C. GARCÍA DE ANDRÉS. 2005. Neural Network Model for Isothermal Pearlite Transformation. Part I: Interlamellar Spacing. ISIJ International 45:2, 229-237. [CrossRef] 180. D. J. G. James, F. Boehringer, K. J. Burnham, D. G. Copp. 2004. Adaptive driver model using a neural network. Artificial Life and Robotics 7:4, 170-176. [CrossRef] 181. Shubhabrata Datta, Malay K. Banerjee. 2004. Optimizing parameters of supervised learning techniques (ANN) for precise mapping of the input-output

relationship in TMCP steels. Scandinavian Journal of Metallurgy 33:6, 310-315. [CrossRef] 182. Andreas Lindemann, Christian L. Dunis, Paulo Lisboa. 2004. Probability distributions, trading strategies and leverage: an application of Gaussian mixture models. Journal of Forecasting 23:8, 559-585. [CrossRef] 183. B. Apolloni, A. Esposito, D. Malchiodi, C. Orovas, G. Palmas, J.G. Taylor. 2004. A General Framework for Learning Rules From Data. IEEE Transactions on Neural Networks 15:6, 1333-1349. [CrossRef] 184. C.K. Loo, M. Rajeswari, M.V.C. Rao. 2004. Novel Direct and Self-Regulating Approaches to Determine Optimum Growing Multi-Experts Network Structure. IEEE Transactions on Neural Networks 15:6, 1378-1395. [CrossRef] 185. I Christov, G Bortolan. 2004. Ranking of pattern recognition parameters for premature ventricular contractions classification by neural networks. Physiological Measurement 25:5, 1281-1290. [CrossRef] 186. Kazuyuki Tanaka, Hayaru Shouno, Masato Okada, D M Titterington. 2004. Accuracy of the Bethe approximation for hyperparameter estimation in probabilistic image processing. Journal of Physics A: Mathematical and General 37:36, 8675-8695. [CrossRef] 187. Patrik Finne, Ralf Finne, Chris Bangma, Jonas Hugosson, Matti Hakama, Anssi Auvinen, Ulf-H�kan Stenman. 2004. Algorithms based on prostate-specific antigen (PSA), free PSA, digital rectal examination and prostate volume reduce false-postitive PSA results in prostate cancer screening. International Journal of Cancer 111:2, 310-315. [CrossRef] 188. D.P. Wipf, B.D. Rao. 2004. Sparse Bayesian Learning for Basis Selection. IEEE Transactions on Signal Processing 52:8, 2153-2164. [CrossRef] 189. G. Deng. 2004. Iterative Learning Algorithms for Linear Gaussian Observation Models. IEEE Transactions on Signal Processing 52:8, 2286-2297. [CrossRef] 190. S. Chen, X. Hong, C.J. Harris. 2004. Sparse Kernel Density Construction Using Orthogonal Forward Regression With Leave-One-Out Test Score and Local Regularization. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:4, 1708-1717. [CrossRef] 191. A. Honkela, H. Valpola. 2004. Variational Learning and Bits-Back Coding: An Information-Theoretic View to Bayesian Learning. IEEE Transactions on Neural Networks 15:4, 800-810. [CrossRef] 192. M. Pardo, G. Sberveglieri. 2004. Remarks on the Use of Multilayer Perceptrons for the Analysis of Chemical Sensor Array Data. IEEE Sensors Journal 4:3, 355-363. [CrossRef] 193. A. Ilin, H. Valpola, E. Oja. 2004. Nonlinear Dynamical Factor Analysis for State Change Detection. IEEE Transactions on Neural Networks 15:3, 559-575. [CrossRef]

194. Yuji Mitsuhata. 2004. Adjustment of regularization in ill-posed linear inverse problems by the empirical Bayes approach. Geophysical Prospecting 52:3, 213-239. [CrossRef] 195. S. Chen, X. Hong, C.J. Harris, P.M. Sharkey. 2004. Sparse Modeling Using Orthogonal Forward Regression With PRESS Statistic and Regularization. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 898-911. [CrossRef] 196. T. D. Saini, J. Weller, S. L. Bridle. 2004. Revealing the nature of dark energy using Bayesian evidence. Monthly Notices of the Royal Astronomical Society 348:2, 603-608. [CrossRef] 197. Alberto Malinverno, Victoria A. Briggs. 2004. Expanded uncertainty quantification in inverse problems: Hierarchical Bayes and empirical Bayes. Geophysics 69:4, 1005. [CrossRef] 198. James L. Beck, Ka-Veng Yuen. 2004. Model Selection Using Response Measurements: Bayesian Probabilistic Approach. Journal of Engineering Mechanics 130:2, 192. [CrossRef] 199. David J. Battle, Peter Gerstoft, William S. Hodgkiss, W. A. Kuperman, Peter L. Nielsen. 2004. Bayesian model selection applied to self-noise geoacoustic inversion. The Journal of the Acoustical Society of America 116:4, 2043. [CrossRef] 200. Klaus Maisinger, M. P. Hobson, A. N. Lasenby. 2004. Maximum-entropy image reconstruction using wavelets. Monthly Notices of the Royal Astronomical Society 347:1, 339-354. [CrossRef] 201. C. Yildiz, S. Sagiroglu, M. Turkmen. 2004. Neural model for coplanar waveguide sandwiched between two dielectric substrates. IEE Proceedings - Microwaves, Antennas and Propagation 151:1, 7. [CrossRef] 202. H.S. Hippert, C.E. Pedreira. 2004. Estimating temperature profiles for short-term load forecasting: neural networks compared to linear models. IEE Proceedings - Generation, Transmission and Distribution 151:4, 543. [CrossRef] 203. Celal Yildiz, Oytun Saracoglu. 2003. Simple models based on neural networks for suspended and inverted microstrip lines. Microwave and Optical Technology Letters 39:5, 383-389. [CrossRef] 204. P. J. Marshall, M. P. Hobson, A. Slosar. 2003. Bayesian joint analysis of cluster weak lensing and Sunyaev-Zel'dovich effect data. Monthly Notices of the Royal Astronomical Society 346:2, 489-500. [CrossRef] 205. Celal Yildiz, Seref Sagiroglu, Oytun Sara�o?lu. 2003. Neural models for coplanar waveguides with a finite dielectric thickness. International Journal of RF and Microwave Computer-Aided Engineering 13:6, 438-446. [CrossRef] 206. François Anctil, Charles Perrin, Vazken Andréassian. 2003. ANN OUTPUT UPDATING OF LUMPED CONCEPTUAL RAINFALL/RUNOFF FORECASTING MODELS. Journal of the American Water Resources Association 39:5, 1269-1279. [CrossRef]

207. C. H. Wiggins, I. Nemenman. 2003. Process pathway inference via time series analysis. Experimental Mechanics 43:3, 361-370. [CrossRef] 208. Chee M. Ng. 2003. Comparison of Neural Network, Bayesian, and Multiple Stepwise Regression–Based Limited Sampling Models to Estimate Area Under the Curve. Pharmacotherapy 23:8, 1044-1051. [CrossRef] 209. C. G. Chua, A. T. C. Goh. 2003. A hybrid Bayesian back-propagation neural network approach to multivariate modelling. International Journal for Numerical and Analytical Methods in Geomechanics 27:8, 651-667. [CrossRef] 210. X. Hong, C.J. Harris, S. Chen, P.M. Sharkey. 2003. Robust nonlinear model identification methods using forward regression. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 33:4, 514-523. [CrossRef] 211. I. Rivals, L. Personnaz. 2003. Neural-network construction and selection in nonlinear modeling. IEEE Transactions on Neural Networks 14:4, 804-819. [CrossRef] 212. S. Chen, X. Hong, C.J. Harris. 2003. Sparse kernel regression modeling using combined locally regularized orthogonal least squares and d-optimality experimental design. IEEE Transactions on Automatic Control 48:6, 1029-1036. [CrossRef] 213. J.T. Kwok, I.W. Tsang. 2003. Linear dependency between Σ and the input noise in Σ-support vector regression. IEEE Transactions on Neural Networks 14:3, 544-553. [CrossRef] 214. Gary G. Yen, Liang-Wei Ho. 2003. Online multiple-model-based fault diagnosis and accommodation. IEEE Transactions on Industrial Electronics 50:2, 296-312. [CrossRef] 215. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 216. Li Zhang, P.B. Luh, K. Kasiviswanathan. 2003. Energy clearing price prediction and confidence interval estimation with cascaded neural networks. IEEE Transactions on Power Systems 18:1, 99-105. [CrossRef] 217. Carlos Jiménez. 2003. Inversion of Odin limb sounding submillimeter observations by a neural network technique. Radio Science 38:4. . [CrossRef] 218. Thomas Schrøder. 2003. Validating the microwave sounding unit stratospheric record using GPS occultation. Geophysical Research Letters 30:14. . [CrossRef] 219. S. Chen, X. Hong, C.J. Harris. 2003. Sparse multioutput radial basis function network construction using combined locally regularised orthogonal least square and D-optimality experimental design. IEE Proceedings - Control Theory and Applications 150:2, 139. [CrossRef] 220. Maoyi Huang. 2003. A transferability study of model parameters for the variable infiltration capacity land surface scheme. Journal of Geophysical Research 108:D22. . [CrossRef]

221. M. M. Saggaf, M. Nafi Toksöz, H. M. Mustafa. 2003. Estimation of reservoir properties from seismic data by smooth neural networks. Geophysics 68:6, 1969. [CrossRef] 222. D. Chakraborty, N.R. Pal. 2003. A novel training scheme for multilayered perceptrons to realize proper generalization and incremental learning. IEEE Transactions on Neural Networks 14:1, 1-14. [CrossRef] 223. Young-Guk KIM, Chan-Kyoung PARK, Hee-Soo HWANG, Tae-Won PARK. 2003. Design Optimization for Suspension System of High Speed Train Using Neural Network. JSME International Journal Series C 46:2, 727-735. [CrossRef] 224. J Mohamad-Saleh, B S Hoyle. 2002. Determination of multi-component flow process parameters based on electrical capacitance tomography data using artificial neural networks. Measurement Science and Technology 13:12, 1815-1821. [CrossRef] 225. M.J. Cassidy, W.D. Penny. 2002. Bayesian nonstationary autoregressive models for biomedical signal analysis. IEEE Transactions on Biomedical Engineering 49:10, 1142-1152. [CrossRef] 226. David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, Angelika van der Linde. 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64:4, 583-639. [CrossRef] 227. Kazuyuki Tanaka. 2002. Statistical-mechanical approach to image processing. Journal of Physics A: Mathematical and General 35:37, R81-R150. [CrossRef] 228. Kwok-Wai Cheung, Dit-Yan Yeung, R.T. Chin. 2002. Bidirectional deformable matching with application to handwritten character extraction. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:8, 1133-1139. [CrossRef] 229. M. Pardo, G. Sberveglieri. 2002. Learning from data: a tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal 2:3, 203-217. [CrossRef] 230. Sung Bo Hwang, W.D. Kim, T.F. Edgar. 2002. Modeling of dry development in bilayer-resist process for 140-nm contact hole patterning. IEEE Transactions on Semiconductor Manufacturing 15:2, 245-252. [CrossRef] 231. Kazuyuki Tanaka, Tsuyoshi Horiguchi. 2002. Solvable Markov random field model in color image restoration. Physical Review E 65:4. . [CrossRef] 232. Thomas Loredo, Donald Lamb. 2002. Bayesian analysis of neutrinos observed from supernova SN 1987A. Physical Review D 65:6. . [CrossRef] 233. S. Chen. 2002. Multi-output regression using a locally regularised orthogonal least-squares algorithm. IEE Proceedings - Vision, Image, and Signal Processing 149:4, 185. [CrossRef] 234. K. Tsuda, M. Sugiyama, K.-R. Miller. 2002. Subspace information criterion for nonquadratic regularizers-Model selection for sparse regressors. IEEE Transactions on Neural Networks 13:1, 70-80. [CrossRef]

235. Ilya Nemenman, William Bialek. 2002. Occam factors and model independent Bayesian learning of continuous distributions. Physical Review E 65:2. . [CrossRef] 236. William Bialek , Ilya Nemenman , Naftali Tishby . 2001. Predictability, Complexity, and LearningPredictability, Complexity, and Learning. Neural Computation 13:11, 2409-2463. [Abstract] [PDF] [PDF Plus] 237. Masashi Sugiyama , Hidemitsu Ogawa . 2001. Subspace Information Criterion for Model SelectionSubspace Information Criterion for Model Selection. Neural Computation 13:8, 1863-1889. [Abstract] [PDF] [PDF Plus] 238. M.C. Medeiros, A. Veiga, C.E. Pedreira. 2001. Modeling exchange rates: smooth transitions, neural networks, and linear models. IEEE Transactions on Neural Networks 12:4, 755-764. [CrossRef] 239. R. Gencay, Min Qi. 2001. Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping, and bagging. IEEE Transactions on Neural Networks 12:4, 726-734. [CrossRef] 240. T. Van Gestel, J.A.K. Suykens, D.-E. Baestaens, A. Lambrechts, G. Lanckriet, B. Vandaele, B. De Moor, J. Vandewalle. 2001. Financial time series prediction using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks 12:4, 809-821. [CrossRef] 241. Paulin Coulibaly, Bernard Bob�e, Fran�ois Anctil. 2001. Improving extreme hydrologic events forecasting using a new criterion for artificial neural network selection. Hydrological Processes 15:8, 1533-1536. [CrossRef] 242. Sumio Watanabe . 2001. Algebraic Analysis for Nonidentifiable Learning MachinesAlgebraic Analysis for Nonidentifiable Learning Machines. Neural Computation 13:4, 899-933. [Abstract] [PDF] [PDF Plus] 243. J Svensson, M von Hellermann, R W T König. 2001. Plasma Physics and Controlled Fusion 43:4, 389-403. [CrossRef] 244. Adam G Polak. 2001. Measurement Science and Technology 12:3, 278-287. [CrossRef] 245. G. Torheim, F. Godtliebsen, D. Axelson, K.A. Kvistad, A. Haraldseth, P.A. Rinck. 2001. Feature extraction and classification of dynamic contrast-enhanced T2*-weighted breast image data. IEEE Transactions on Medical Imaging 20:12, 1293-1301. [CrossRef] 246. S. Chen, S.R. Gunn, C.J. Harris. 2001. The relevance vector machine technique for channel equalization application. IEEE Transactions on Neural Networks 12:6, 1529-1532. [CrossRef] 247. Junita Mohamad-Saleh, Brian S. Hoyle, Frank J. W. Podd, D. Mark Spink. 2001. Direct process estimation from tomographic data using artificial neural systems. Journal of Electronic Imaging 10:3, 646. [CrossRef]

248. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 249. S. Watanabe. 2001. Learning efficiency of redundant neural networks in Bayesian estimation. IEEE Transactions on Neural Networks 12:6, 1475-1486. [CrossRef] 250. John H. Xin, Sijie Shao, Korris Fu-lai Chung. 2000. Colour-appearance modeling using feedforward networks with Bayesian regularization method? Part I: Forward model. Color Research & Application 25:6, 424-434. [CrossRef] 251. Manfred Opper , Ole Winther . 2000. Gaussian Processes for Classification: Mean-Field AlgorithmsGaussian Processes for Classification: Mean-Field Algorithms. Neural Computation 12:11, 2655-2684. [Abstract] [PDF] [PDF Plus] 252. Dirk Husmeier . 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural NetworksThe Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus] 253. T J Sabin, C A L Bailer-Jones, P J Withers. 2000. Modelling and Simulation in Materials Science and Engineering 8:5, 687-706. [CrossRef] 254. J. Mateos, A.K. Katsaggelos, R. Molina. 2000. A Bayesian approach for the estimation and transmission of regularization parameters for reducing blocking artifacts. IEEE Transactions on Image Processing 9:7, 1200-1215. [CrossRef] 255. K. Wong, Hidetoshi Nishimori. 2000. Error-correcting codes and image restoration with multiple stages of dynamics. Physical Review E 62:1, 179-190. [CrossRef] 256. O. Lahav, S. L. Bridle, M. P. Hobson, A. N. Lasenby, L. Sodre. 2000. Bayesian 'hyper-parameters' approach to joint estimation: the Hubble constant from CMB measurements. Monthly Notices of the Royal Astronomical Society 315:4, L45-L49. [CrossRef] 257. Sumio Watanabe. 2000. On the generalization error by a layered statistical model with Bayesian estimation. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 83:6, 95-106. [CrossRef] 258. J. F. G. de Freitas , M. Niranjan , A. H. Gee , A. Doucet . 2000. Sequential Monte Carlo Methods to Train Neural Network ModelsSequential Monte Carlo Methods to Train Neural Network Models. Neural Computation 12:4, 955-993. [Abstract] [PDF] [PDF Plus] 259. Kazumi Saito , Ryohei Nakano . 2000. Second-Order Learning Algorithm with Squared Penalty TermSecond-Order Learning Algorithm with Squared Penalty Term. Neural Computation 12:3, 709-729. [Abstract] [PDF] [PDF Plus] 260. J. Heald, J. Stark. 2000. Estimation of Noise Levels for Models of Chaotic Dynamical Systems. Physical Review Letters 84:11, 2366-2369. [CrossRef]

261. M S Hazell, R Jones, P E Luffman, R F Sewell. 2000. Measurement Science and Technology 11:3, 227-236. [CrossRef] 262. R. Fischer, K. Hanson, V. Dose, W. von der Linden. 2000. Background estimation in experimental spectra. Physical Review E 61:2, 1152-1160. [CrossRef] 263. Alberto Malinverno. 2000. A Bayesian criterion for simplicity in inverse problem parametrization. Geophysical Journal International 140:2, 267-285. [CrossRef] 264. C.C. Holmes, B.K. Mallick. 2000. Bayesian wavelet networks for nonparametric regression. IEEE Transactions on Neural Networks 11:1, 27-35. [CrossRef] 265. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 266. A. Veiga, M.C. Medeiros. 2000. A hybrid linear-neural model for time series forecasting. IEEE Transactions on Neural Networks 11:6, 1402-1412. [CrossRef] 267. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 268. James Tin-Yau Kwok. 2000. The evidence framework applied to support vector machines. IEEE Transactions on Neural Networks 11:5, 1162-1173. [CrossRef] 269. Rudolf Kulhavý, Petya Ivanova. 1999. Quo vadis, Bayesian identification?. International Journal of Adaptive Control and Signal Processing 13:6, 469-485. [CrossRef] 270. David J. C. MacKay . 1999. Comparison of Approximate Methods for Handling HyperparametersComparison of Approximate Methods for Handling Hyperparameters. Neural Computation 11:5, 1035-1068. [Abstract] [PDF] [PDF Plus] 271. Yukito Iba. 1999. Journal of Physics A: Mathematical and General 32:21, 3875-3888. [CrossRef] 272. H. Attias . 1999. Independent Factor AnalysisIndependent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus] 273. J. Clark, K. Gernoth, S. Dittmar, M. Ristig. 1999. Higher-order probabilistic perceptrons as Bayesian inference engines. Physical Review E 59:5, 6161-6174. [CrossRef] 274. Bernhard Schottky, David Saad. 1999. Journal of Physics A: Mathematical and General 32:9, 1605-1621. [CrossRef] 275. G. P. Liu, V. Kadirkamanathan. 1999. Multiobjective criteria for neural network structure selection and identification of nonlinear systems using genetic algorithms. IEE Proceedings - Control Theory and Applications 146:5, 373. [CrossRef]

276. Arun Tholudur, W. Fred Ramirez, James D. McMillan. 1999. Mathematical modeling and optimization of cellulase protein production usingTrichoderma reesei RL-P37. Biotechnology and Bioengineering 66:1, 1-16. [CrossRef] 277. R. Molina, A.K. Katsaggelos, J. Mateos. 1999. Bayesian and regularization methods for hyperparameter estimation in image restoration. IEEE Transactions on Image Processing 8:2, 231-246. [CrossRef] 278. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239-1243. [CrossRef] 279. J.T.-Y. Kwok. 1999. Moderating the outputs of support vector machine classifiers. IEEE Transactions on Neural Networks 10:5, 1018-1031. [CrossRef] 280. W.A. Wright. 1999. Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Networks 10:6, 1261-1270. [CrossRef] 281. G.P. Liu, V. Kadirkamanathan, S.A. Billings. 1999. Variable neural networks for adaptive control of nonlinear systems. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 29:1, 34-43. [CrossRef] 282. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef] 283. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 284. Siegfried Bös. 1998. Statistical mechanics approach to early stopping and weight decay. Physical Review E 58:1, 833-844. [CrossRef] 285. Siegfried Bös, Manfred Opper. 1998. Journal of Physics A: Mathematical and General 31:21, 4835-4850. [CrossRef] 286. G Dirscherl, B Schottky, U Krey. 1998. Journal of Physics A: Mathematical and General 31:11, 2519-2540. [CrossRef] 287. P. Magni, R. Bellazzi, G. De Nicolao. 1998. Bayesian function learning using MCMC methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:12, 1319-1331. [CrossRef] 288. Frederico V. Prudente, Paulo H. Acioli, J. J. Soares Neto. 1998. The fitting of potential energy surfaces using neural networks: Application to the study of vibrational levels of H[sub 3][sup +]. The Journal of Chemical Physics 109:20, 8801. [CrossRef] 289. Kwok-Wai Cheung, Dit-Yan Yeung, R.T. Chin. 1998. A Bayesian framework for deformable pattern recognition with application to handwritten character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:12, 1382-1388. [CrossRef] 290. Jason A. S. Freeman , David Saad . 1997. Online Learning in Radial Basis Function NetworksOnline Learning in Radial Basis Function Networks. Neural Computation 9:7, 1601-1622. [Abstract] [PDF] [PDF Plus]

291. Jason Freeman, David Saad. 1997. Dynamics of on-line learning in radial basis function networks. Physical Review E 56:1, 907-918. [CrossRef] 292. Haiwen Ye, Weidou Ni. 1997. Static and transient performance prediction for CFB boilers using a Bayesian-Gaussian Neural Network. Journal of Thermal Science 6:2, 141-148. [CrossRef] 293. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 294. Padhraic Smyth , David Heckerman , Michael I. Jordan . 1997. Probabilistic Independence Networks for Hidden Markov Probability ModelsProbabilistic Independence Networks for Hidden Markov Probability Models. Neural Computation 9:2, 227-269. [Abstract] [PDF] [PDF Plus] 295. Vijay Balasubramanian . 1997. Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability DistributionsStatistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions. Neural Computation 9:2, 349-368. [Abstract] [PDF] [PDF Plus] 296. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 297. Anders Krogh, Peter Sollich. 1997. Statistical mechanics of ensemble learning. Physical Review E 55:1, 811-825. [CrossRef] 298. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131-1148. [CrossRef] 299. Glenn Marion, David Saad. 1996. Journal of Physics A: Mathematical and General 29:17, 5387-5404. [CrossRef] 300. Huaiyu Zhu, Richard Rohwer. 1996. Bayesian regression filters and the issue of priors. Neural Computing & Applications 4:3, 130-142. [CrossRef] 301. R.R. Schultz, R.L. Stevenson. 1996. Extraction of high-resolution frames from video sequences. IEEE Transactions on Image Processing 5:6, 996-1011. [CrossRef] 302. M. Revow, C.K.I. Williams, G.E. Hinton. 1996. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:6, 592-606. [CrossRef] 303. Richard Rohwer , Michał Morciniec . 1996. A Theoretical and Experimental Account of n-Tuple Classifier PerformanceA Theoretical and Experimental Account of n-Tuple Classifier Performance. Neural Computation 8:3, 629-642. [Abstract] [PDF] [PDF Plus] 304. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus]

305. Richard Rohwer , John C. van der Rest . 1996. Minimum Description Length, Regularization, and Multimodal DataMinimum Description Length, Regularization, and Multimodal Data. Neural Computation 8:3, 595-609. [Abstract] [PDF] [PDF Plus] 306. David F. R. Brown, Mark N. Gibbs, David C. Clary. 1996. Combining ab initio computations, neural networks, and diffusion Monte Carlo: An efficient method to treat weakly bound molecules. The Journal of Chemical Physics 105:17, 7597. [CrossRef] 307. H.H. Thodberg. 1996. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks 7:1, 56-72. [CrossRef] 308. Huaiyu Zhu, Richard Rohwer. 1995. Bayesian invariant measurements of generalization. Neural Processing Letters 2:6, 28-31. [CrossRef] 309. Gustavo Deco, Bernd Schürmann. 1995. Statistical-ensemble theory of redundancy reduction and the duality between unsupervised and supervised neural learning. Physical Review E 52:6, 6580-6587. [CrossRef] 310. J. A. S. Freeman , D. Saad . 1995. Learning and Generalization in Radial Basis Function NetworksLearning and Generalization in Radial Basis Function Networks. Neural Computation 7:5, 1000-1020. [Abstract] [PDF] [PDF Plus] 311. Gustavo Deco, Dragan Obradovic. 1995. Statistical physics theory of query learning by an ensemble of higher-order neural networks. Physical Review E 52:2, 1953-1957. [CrossRef] 312. G Marion, D Saad. 1995. Journal of Physics A: Mathematical and General 28:8, 2159-2171. [CrossRef] 313. J M Pryce, A D Bruce. 1995. Journal of Physics A: Mathematical and General 28:3, 511-532. [CrossRef] 314. Lars Kai Hansen , Carl Edward Rasmussen . 1994. Pruning from Adaptive RegularizationPruning from Adaptive Regularization. Neural Computation 6:6, 1223-1232. [Abstract] [PDF] [PDF Plus] 315. Michael S. Lewicki . 1994. Bayesian Modeling and Classification of Neural SignalsBayesian Modeling and Classification of Neural Signals. Neural Computation 6:5, 1005-1030. [Abstract] [PDF] [PDF Plus] 316. A D Bruce, D Saad. 1994. Journal of Physics A: Mathematical and General 27:10, 3355-3363. [CrossRef] 317. Nenad Ivezic, James H. Garrett. 1994. A neural network-based machine learning approach for supporting synthesis. Artificial Intelligence for Engineering, Design, Analysis and Manufacturing 8:02, 143. [CrossRef] 318. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 319. Visakan Kadirkamanathan , Mahesan Niranjan . 1993. A Function Estimation Approach to Sequential Learning with Neural NetworksA Function Estimation

Approach to Sequential Learning with Neural Networks. Neural Computation 5:6, 954-975. [Abstract] [PDF] [PDF Plus] 320. David J. C. MacKay . 1992. The Evidence Framework Applied to Classification NetworksThe Evidence Framework Applied to Classification Networks. Neural Computation 4:5, 720-736. [Abstract] [PDF] [PDF Plus]

Communicated by David Haussler

A Practical Bayesian Framework for Backpropagation Networks David J. C. MacKay’ Computation and Neural Systems, California lnstitute of Technology 139-74, Pasadena, C A 91125 USA

A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1)objective comparisons between solutions using alternative network architectures, (2) objective stopping rules for network pruning or growing procedures, (3) objective choice of magnitude and type of weight decay terms or additive regularizers (for penalizing large weights, etc.), (4) a measure of the effective number of well-determined parameters in a model, (5) quantified estimates of the error bars on network parameters and on network output, and (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian “evidence” automatically embodies ”Occam’s razor,’’ penalizing overflexible and overcomplex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalization ability and the Bayesian evidence is obtained.

This paper makes use of the Bayesian framework for regularization and model comparison described in the companion paper “Bayesian Interpolation” (MacKay 1992a). This framework is due to Gull and Skilling (Gull 1989). 1 The Gaps in Backprop There are many knobs on the black box of “backprop” [learning by backpropagation of errors (Rumelhart et al. 198611. Generally these knobs are set by rules of thumb, trial and error, and the use of reserved test data to assess generalization ability (or more sophisticated cross-validation). The knobs fall into two classes: (1) parameters that change the effective learning model, for example, number of hidden units, and weight decay ‘Present address: Darwin College, Cambridge CB3 9EU, U.K. Neural Computation 4,448-472 (1992) @ 1992 Massachusetts Institute of Technology

Bayesian Framework for Backpropagation Networks

449

terms; and (2) parameters concerned with function optimization technique, for example, "momentum" terms. This paper is concerned with making objective the choice of the parameters in the first class, and with ranking alternative solutions to a learning problem in a way that makes full use of all the available data. Bayesian techniques will be described that are both theoretically well-founded and practically implementable. Let us review the basic framework for learning in networks, then discuss the points at which objective techniques are needed. The training set for the mapping to be learned is a set of input-target pairs D = {xm,t"}, where rn is a label running over the pairs. A neural network architecture A is invented, consisting of a specification of the number of layers, the number of units in each layer, the type of activation function performed by each unit, and the available connections between the units. If a set of values w is assigned to the connections in the network, the network defines a mapping y(x; w, A) from the input activities x to the output activities y.' The distance of this mapping to the training set is measured by some error function; for example, the error for the entire data set is commonly taken to be 1 ED(D 1 W, d)= C 2 [Y(X";

W, d)- t"]'

(1.1)

m

The task of "learning" is to find a set of connections w that gives a mapping that fits the training set well, that is, has small error ED; it is also hoped that the learned connections will "generalize" well to new examples. Plain backpropagation learns by performing gradient descent on ED in w-space. Modifications include the addition of a "momentum" term, and the inclusion of noise in the descent process. More efficient optimization techniques may also be used, such as conjugate gradients or variable metric methods. This paper will not discuss computational modifications concerned only with speeding the optimization. It will address, however, those modifications to the plain backprop algorithm that implicitly or explicitly modify the objective function, with decay terms or regularizers. It is moderately common for extra regularizing terms Ew(w) to be added to ED; for example, terms that penalize large weights may be introduced, in the hope of achieving a smoother or simpler mapping (Hinton and Sejnowski 1986; Ji etal. 1990; Nowlan 1991; Rumelhart 1987; Weigend et al. 1991). Some of the "hints" in Abu-Mostafa (1990b) also fall into the category of additive weight-dependent energies. A sample weight energy term is

i L

'The framework developed in this paper will apply not only to networks composed of "neurons," but to any regression model for which we can compute the derivatives of the outputs with respect to the parameters, *(x;w,d)/dw.

David J. C. MacKay

450

The weight energy may be implicit, for example, “weight decay” (subtraction of a multiple of w in the weight change rule) corresponds to the energy in equation 1.2. Gradient-based optimization is then used to minimize the combined function:

M

+ PED(DI W,d)

= @E~(wI d)

(1.3)

where a and P are “black box” parameters. The constant cy. should not be confused with the ”momentum” parameter sometimes introduced into backprop; in the present context a is a decay rate or regularizing constant. Also note that a should not be viewed as causing ”forgetting”; ED is defined as the error on the entire data set, so gradient descent on M treats all data points equally irrespective of the order in which they were acquired. 1.1 What Is Lacking. The above procedures include a host of free parameters such as the choice of neural network architecture, and of the regularizing constant a. There are not yet established ways of objectively setting these parameters, though there are many rules of thumb (see Ji et al. 1990; Weigend et al. 1991, for examples). One popular way of comparing networks trained with different parameter values is to assess their performance by measuring the error on an unseen test set or by similar cross-validation techniques. The data are divided into two sets, a training set that is used to optimize the parameters w of the network, and a test set that is used to optimize control parameters such as a and the architecture A. However, the utility of these techniques in determining values for the parameters a and /3 or for comparing alternative network solutions, etc., is limited because a large test set may be needed to reduce the signal-to-noise ratio in the test error, and cross-validation is computationally demanding. Furthermore, if there are several parameters such as a and p, it is out of the question to optimize such parameters by repeating the learning with all possible values of these parameters and using a test set. Such parameters must be optimized on line. It is, therefore, interesting to study objective criteria for setting free parameters and comparing alternative solutions, that depend only on the data set used for the training. Such criteria will prove especially important in applications where the total amount of data is limited, so that one does not want to sacrifice good data for use as a test set. Rather, we wish to find a way to use all our data in the process of optimizing the parameters w and in the process of optimizing control parameters such as a and A. This paper will describe practical Bayesian methods for filling the following holes in the neural network framework just described: 1. Objective criteria for comparing alternative neural network solutions, in particular with different architectures A. Given a single architecture

Bayesian Framework for Backpropagation Networks

451

A, there may be more than one minimum of the objective function M . If there is a large disparity in M between the minima then it is plausible to choose the solution with smallest M . But where the difference is not so great it is desirable to be able to assign an objective preference to the alternatives. It is also desirable to be able to assign preferences to neural network solutions using different numbers of hidden units, and different activation functions. Here there is an "Occam's razor" problem: the more free parameters a model has, the smaller the data error ED it can achieve. So we cannot simply choose the architecture with smallest data error. That would lead us to an overcomplex network that generalizes poorly. The use of weight decay does not fully alleviate this problem; networks with too many hidden units still generalize worse, even if weight decay is used (see Section 4). 2. Objective criteria for setting the decay rate a. As in the choice of A above, there is an "Occam's razor" problem: a small value of a in equation 1.3 allows the weights to become large and overfit the noise in the data. This leads to a small value of the data error ED (and a small value of M ) , so we cannot base our choice of a only on ED or M . The Bayesian solution presented here can be implemented on-line, that is, it is not necessary to do multiple learning runs with different values of a in order to find the best. 3. Objective choice of regularizing function Ew. 4. Objective criteria for choosing between a neural network solution and a

solution using a different learning or interpolation model, for example, splines or radial basis functions. 1.2 The Probability Connection. Tishby et al. (1989)introduced a probabilistic view of learning that is an important step toward solving the problems listed above. The idea is to force a probabilistic interpretation onto the neural network technique so as to be able to make objective statements. This interpretation does not involve the addition of any new arbitrary functions or parameters, but it involves assigning a meaning to the functions and parameters that are already used. My work is based on the same probabilistic framework, and extends it using concepts and techniques adapted from Gull and Skilling's (Gull 1989) Bayesian image reconstruction methods. This paper also adopts a shift in emphasis from Tishby et al.'s paper. Their work concentrated on predicting the average generalization ability of one network trained on a task drawn from a known prior ensemble of tasks. This is called forward probability. In this paper the emphasis will be on quantifying the relative plausibilities of many alternative solutions to an interpolation or classification task; that task is defined by a single data set produced by the real world, and we do not know the prior ensemble from which the

David J. C. MacKay

452

task comes. This is called inverse probability. This paper avoids using the language of statistical physics, in order to maintain wider readability, and to avoid concepts that would sound strange in that language; for example, ”the probability distribution of the temperature” is unfamiliar in physics, but “the probability distribution of the noise variance’’ is its innocent counterpart in literal terms. Let us now review the probabilistic interpretation of network learning. 0

Likelihood. A network with specified architecture A and connections w is viewed as making predictions about the target outputs as a function of input x in accordance with the probability distribution:

where Z,(P) = Jdt exp(-PE). E is the error for a single datum, and P is a measure of the presumed noise included in t. If E is the quadratic error function then this corresponds to the assumption that t includes additive gaussian noise with variance u‘, = l/P. 0

Prior. A prior probability is assigned to alternative network connection strengths w, written in the form:

where ZW = Jdkw exp(-aEw). Here a is a measure of the characteristic expected connection magnitude. If Ew is quadratic as specified in equation 1.2 then weights are expected to come from a gaussian with zero mean and variance o& = l / a . Alternative “regularizers” R (each using a different energy function Ew) implicitly correspond to alternative hypotheses about the statistics of the environment. 0

The posterior probability of the network connections w is then (1.6)

where Z M ( P~), = Jdkwexp(-aEw-PED). Notice that the exponent in this expression is the same as (minus) the objective function M defined in equation 1.3.

+

So under this framework, minimization of M = a E w BED is identical to finding the (locally) most probable parameters WMP; minimization of ED alone is identical to finding the maximum likelihood parameters WML.Thus an interpretation has been given to backpropagation’s energy functions ED and Ew, and to the parameters a and P. It should be

Bayesian Framework for Backpropagation Networks

453

emphasized that "the probability of the connections w" is a measure of plausibility that the model's parameters should have a specified value w; this has nothing to do with the probability that a particular algorithm might converge to w. This framework offers some partial enhancements for backprop methods: The work of Levin et al. (1989) makes it possible to predict the average generalization ability of neural networks trained on one of a defined class of problems. However, it is not clear whether this will lead to a practical technique for choosing between alternative network architectures for real data sets. Le Cun ef al. (1990) have demonstrated how to estimate the "saliency" of a weight, which is the change in M when the weight is deleted. They have used this measure successfully to simplify large neural networks. However no stopping rule for weight deletion was offered other than measuring performance on a test set. Also Denker and Le Cun (1991) demonstrated how the Hessian of M can be used to assign error bars to the parameters of a network and to its outputs. However, these error bars can be quantified only once ,6 is quantified, and how to do this without prior knowledge or extra data has not been demonstrated. In fact p can be estimated from the training data alone. 2 Review of Bayesian Regularization and Model Comparison ___

In the companion paper (MacKay 1992a) it was demonstrated how the control parameters a and /3 are assigned by Bayes, and how alternative interpolation models can be compared. It was noted there that it is not satisfactory to optimize a and p by finding the joint maximum likelihood value of w, a , p; the likelihood has a skew peak whose maximum is not located at the most probable values of the control parameters. MacKay (1992a) also reviewed how the Bayesian choice of a and ,B is neatly expressed in terms of a measure of the number of well-determined parameters in a model, y. However that paper assumed that M(w) has only one significant minimum that was well approximated as quadratic. [All the interpolation models discussed in MacKay (1992a) can be interpreted as two-layer networks with a fixed nonlinear first layer and adaptive linear second layer.] In this section I briefly review the Bayesian framework, retaining that assumption. The following section will then discuss how the framework can be modified to handle neural networks, where the landscape of M(w) is certainly not quadratic. 2.1 Determination of a and P. By Bayes' rule, the posterior probability for these parameters is (2.1)

David J. C. MacKay

454

Now if we assign a uniform prior to (a,,@,the quantity of interest for assigning preferences to (a,P) is the first term on the right-hand side, the evidence for a , p, which can be written as2

where ZMand ZWwere defined earlier and ZD = JdNDe-BEo, Let us use the simple quadratic energy functions defined in equations 1.1 and 1.2. This makes the analysis easier, but more complex cases can still in principle be handled by the same approach. Let the number of degrees of freedom in the data set, that is, the number of output units times the number of data pairs, be N, and let the number of free parameters, that is, the dimension of w, be k. Then we can immediately evaluate the gaussian integrals ZD and ZW:ZD = ( 2 ~ / @ ) ~and / ~ ZW , = (27~/a)~/~. NOWwe want to find Z M ( P) ~ , = Jdkw exp[-M(w, a , P)]. Supposing for now that M has a single minimum as a function of w, at WMP, and assuming we can locally approximate M as quadratic there, the integral ZM is approximated by ZMN e-M(WMF’)(2a)k/2det-’/2A

(2.3)

where A = VVM is the Hessian of M evaluated at WMP. The maximum of P( D I a , p, A, R) has the following useful properties: (2.4) (2.5) where y is the effective number of parameters determined by the data, (2.6) where A, are the eigenvalues of the quadratic form @EDin the natural basis of Ew. 2.2 Comparison of Different Models. To rank alternative architectures and penalty functions Ew in the light of the data, we simply evaluate which appeared as the normalizing constant the evidence, P ( D 1 d,R), in equation 2.1. Integrating the evidence for ( a ,P), we have:

P ( D I A,R)=

1

P ( D I a , P, A, RIP(&,P ) da dP

(2.7)

The evidence is the Bayesian’s transportable quantity for comparing models in the light of the data. 2The same notation, and the same abuses thereof, will be used as in MacKay (1992a).

Bayesian Framework for Backpmpagation Networks

455

3 Adapting the Framework

For neural networks, M(w) is not quadratic. Indeed it is well known that M typically has many local minima. And if the network has a symmetry under permutation of its parameters, then we know that M(w) must share that symmetry, so that every single minimum belongs to a family of symmetric minima of M. For example, if there are H hidden units in a single layer then each nondegenerate minimum is in a family of size g = H! 2ff. Now it may be the case that the significant minima of M are locally quadratic, so we might be able to evaluate ZMby evaluating equation 2.3 at each significant minimum and adding up the ZMS;but the number of those minima is unknown, and this approach to evaluating Z M would seem dubious. Luckily, however, we do not actually want to evaluate ZM.We would need to evaluate ZM in order to assign a posterior probability over al,f? for an entire model, and to evaluate the evidence for alternative entire models. This is not quite what we wish to do: when we use a neural network to perform a mapping, we typically implement only one neural network at a time, and this network will have its parameters set to a particular solution of the learning problem. Therefore, the alternatives we wish to rank are the different solutions of the learning problem, that is, the different minima of M. We would want the evidence as a function of the number of hidden units only if we were somehow able to simultaneously implement the entire posterior ensemble of networks for one number of hidden units. Similarly, we do not want the posterior over a , /3 for the entire posterior ensemble; rather, it is reasonable to allow each solution (each minimum of M) to choose its own optimal value for these parameters. The same method of chopping up a complex model space is used in the unsupervised classification system, Autoclass (Hanson et al. 1991). Having adopted this slight shift in objective, it turns out that to set a and P and to compare alternative solutions to a learning problem, the integral we now need to evaluate is a local version of ZM. Assume that the posterior probability consists of well-separated islands in parameter space each centered on a minimum of M. We wish to evaluate how much posterior probability mass is in each of these islands. Consider a minimum located at w*, and define a solution .S, as the ensemble of networks in the neighborhood of w*, and all symmetric permutations of that ensemble. Let us evaluate the posterior probability for alternative soIutions Sw*,and the parameters a and P:

where g is the permutation factor, and

456

David J. C. MacKay

where the integral is performed only over the neighborhood of the minimum at w*.I will refer to the quantity g[ZE(w*,a , @)/&(a)ZD(@)]as the evidence for a , P,. S , . The parameters cr and P will be chosen to maximize this evidence. Then the quantity we want to evaluate to compare alternative solutions is the evidence3 for Sw.,

This paper uses the gaussian approximation for Z t :

where A = VVM is the Hessian of M evaluated at w*. For general a and P this approximation is probably unacceptable; however, we need it only to be accurate for the small range of a and @ close to their most probable value. The regime in which this approximation will definitely break down is when the number of constraints, N , is small relative to the number of free parameters, k. For large N / k the central limit theorem encourages us to use the gaussian approximation (Walker 1967). It is a matter for further research to establish how large N / k must be for this approximation to be reliable. What obstacles remain to prevent us from evaluating the local Z t ? We need to evaluate or approximate the inverse Hessian of M, and we need to evaluate or approximate its determinant and/or trace (MacKay 1992a). Denker and Le Cun (1991) and Le Cun et aZ. (1990) have already discussed how to approximate the Hessian of ED for the purpose of evaluating weight saliency and for assigning error bars to weights and network outputs. The Hessian can be evaluated in the same way that backpropagation evaluates VED (see Bishop 1992 for a complete algorithm and the appendix of this paper for a useful approximation). Alternatively A can be evaluated by numerical methods, for example second differences. A third option: if variable metric methods are used to minimize M instead of gradient descent, then the inverse Hessian is automatically generated during the search for the minimum. It is important, for the success of this Bayesian method, that the off-diagonal terms of the Hessian should be evaluated. Denker et aZ.'s method can do this without any additional complexity. The diagonal approximation is no good because of the strong posterior correlations in the parameters. 3Bayesian model comparison is performed by evaluating and comparing the evidence for alternative models. Gull and Skilling defined the evidence for a model 3-1 to be P ( D I 1-I). The existence of multiple minima in neural network parameter space complicates model comparison. The quantity in equation 3.2 is not P ( D I. ,S ,A, 72) (it includes the prior for. ,S I A, R),but I have called it the evidence because it is the quantity we should evaluate to compare alternative solutions with each other and with other models.

Bayesian Framework for Backpropagation Networks

457

4 Demonstration

This demonstration examines the evidence for various neural net solutions to a small interpolation problem, the mapping for a two joint robot arm,

(ya,yb) = rl cos 6J1 + r2 cos(O1 + 02),r1 sin B1 + r2 sin(& + 0,) For the training set I used rl = 2.0 and r;! = 1.3, random samples from a

(el,0,)

+

restricted range of (el,0,) were made, and gaussian noise of magnitude 0.05 was added to the outputs. The neural nets used had one hidden layer of sigmoid units and linear output units. During optimization, the regularizer (equation 1.2) was used initially, and an alternative regularizer was introduced later; ,O was fixed to its true value (to enable demonstration of the properties of the quantity y), and a was allowed to adapt to its locally most probable value. Figure 1 illustrates the performance of a typical neural network trained in this way. Each output is accompanied by error bars evaluated using Denker et aZ.'s method, including of-diagonal Hessian terns. If ,O had not been known in advance, it could have been inferred from the data using equation 2.5. For the solution displayed, the model's estimate of ,O in fact differed negligibly from the true value, so the displayed error bars are the same as if ,O had been inferred from the data. Figure 2 shows the data misfit versus the n w b e r of hidden units. Notice that, as expected, the data error tends to decrease monotonically with increasing number of parameters. Figure 3 shows the error of these same solutions on an unseen test set, which does not show the same trend as the data error. The data misfit cannot serve as a criterion for choosing between solutions. Figure 4 shows the evidence for about 100 different solutions using different numbers of hidden units. Notice how the evidence maximum has the characteristic shape of an "Occam hill" - steep on the side with too few parameters, and shallow on the side with too many parameters. The quadratic approximations break down when the number of parameters becomes too big compared with the number of data points. Figure 5 introduces the quantity y, discussed in MacKay (1992a), the number of well-measured parameters. In cases where the evaluation of the evidence proves difficult, it may be that y will serve as a useful tool. For example, sampling theory predicts that the addition of redundant parameters to a model should reduce x& by one unit per well-measured parameter; a stopping criterion could detect the point at which, as parameters are deleted, x; started to increase faster than with gradient l with decreasing y (Figure 6).* This use of y requires prior knowledge of the noise level P; that is why ,O was fixed to its known value for these demonstrations. 4This suggestion is closely related to Moody's (1991) "generalized prediction error," GPE = (& + 2 7 ) / N .

David J. C. MacKay

458

P

.

I

Figure 1: Typical neural network output (inset - training set). This is the output space (y,,yb) of the network. The target outputs are displayed as small x's, and the output of the network with 1u error bars is shown as a a dot surrounded by an ellipse. The network was trained on samples in two regions in the lower and upper half planes (inset). The outputs illustrated here are for inputs extending a short distance outside the training regions, and bridging the gap between them. Notice that the error bars get much larger around the perimeter. They also increase slightly in the gap between the training regions. These pleasing properties would not have been obtained had the diagonal Hessian approximation of Denker and Le Cun (1991) been used. The above solution was created by a three layer network with 19 hidden units.

Now the question is how good a predictor of network quality the evidence is. The fact that the evidence has a maximum at a reasonable number of hidden units is promising. A comparison with Figure 3 shows that the performance of the solutions on an unseen test set has similar overall structure to the evidence. However, Figure 7 shows the evidence against the performance on a test set, and it can be seen that a significant number of solutions with poor evidence actually perform well on the test set. Something is wrong! It is time for a discussion of the relationship between the evidence and generalization ability. We will return later to the failure in Figure 7 and see that it is rectified by the development of new, more probable regularizers.

Bayesian Framework for Backpropagation Networks

459

750 0

700

650

600

550

+

0

0

0 0 0

ii

w m u

0

0

500

0

i

t

0

450

o

0

+

+

P

+

o

*

: ; *

400

350 300

6

8

12 14 Number of hidden U n

10

16

18

20

its

Figure 2: Data error versus number of hidden units. Each point represents one converged neural network, trained on a 200 i/o pair training set. Each neural net was initialized with different random weights and with a different initial value of u& = 1/a.The two point-styles correspond to small and large initial values for U W . The error is shown in dimensionless x2 units such that the expectationof error relative to the truth is 400 f20. The solid line is 400 - k, where k is the number of free parameters.

4.1 Relation to "Generalization Error". What is the relationship between the evidence and the generalization error (or its close relative, cross-validation)? A correlation between the two is certainly expected. But the evidence is not necessarily a good predictor of generalization error (see discussion in MacKay, 1992a). First, as illustrated in Figure 8, the error on a test set is a noisy quantity, and many data have to be devoted to the test set to get an acceptable signal-to-noise ratio. Furthermore, imagine that two models have generated solutions to an interpolation problem, and that their two most probable interpolants are completely identical. In this case, the generalization error for the two solutions must be the same, but the evidence will not in general be the same: typically, the model that was a priori more complex will suffer a larger Occam factor and will have smaller evidence. Also, the evidence is a measure of plausibility of the whole ensemble of networks about the optimum, not

David J. C. MacKay

460

6

8

10 12 14 N u m b e r of hidden u n i t s

16

18

20

Figure 3: Test error versus number of hidden units. The training set and test set both had 200 data points. The test error for solutions found using the first regularizer is shown in dimensionless x2 units such that the expectationof error relative to the truth is 400 f 20. just the optimal network. Thus, there is more to the evidence than there is to the generalization error. 4.2 What If the Bayesian Method Fails? I do not want to dismiss the utility of the generalization error: it can be important for detecting failures of the model being used. For example, if we obtain a poor correlation between the evidence and the generalization error, such that Bayes fails to assign a strong preference to solutions that actually perform well on test data, then we are able to detect and attempt to correct such failures. A failure indicates one of two things, and in either case we are able to learn and improve: either numerical inaccuracies in the evaluation of the probabilities caused the failure, or else the alternative models that were offered to Bayes were a poor selection, ill-matched to the real world (for example, using inappropriate regularizers). When such a failure is detected, it prompts us to examine our models and try to discover the implicit assumptions in the model that the data did not agree with;

Bayesian Framework for Backpropagation Networks

"Small-start" "Large-start"

500

-

480

-

460

-

440

-

0

8

8-

0

0

w >

0"

0 +

4

B

4

461

420

-

8

0

*

d

+

+

t-

+

0

400

-2

380

-

360

-

+

Q

L

6

8

10

12

14

16

18

20

Figure 4: Log evidence for solutions using the first regularizer. For each solution, the evidence was evaluated. Notice that an evidence maximum is achieved by neural network solutions using 10, 11, and 12 hidden units. For more than -19 hidden units, the quadratic approximationsused to evaluate the evidence are believed to break down. The number of data points N is 400 (i.e., 200 i/o pairs); cf. number of parameters in a net with 20 hidden units = 102. alternative models can be tried until one is found that makes the data more probable. We have just met exactly such a failure. Let us now establish what assumption in our model caused this failure and learn from it. Note that this mechanism for human learning is not available to those who just use the test error as their performance criterion. Going by the test error alone, there would have been no indication that there was a serious mismatch between the model and the data. 4.3 Back to the Demonstration: Comparing Different Regularizers. The demonstrations thus far used the regularizer of equation 1.2. This is equivalent to a prior that expects all the weights to have the same characteristic size. This is actually an inconsistent prior: the input and output variables and hidden unit activities could all be arbitrarily rescaled; if the same mapping is to be performed (a simple consistency requirement), such transformations of the variables would imply independent rescaling

David J. C. MacKay

462

20

30

40

50 60 70 81 total numher o f parameters, k

90

100

110

Figure 5: The number of well-determined parameters. This figure displays as a function of k, for the same network solutions as iR Figure 4.

of the weights to the hidden layer and to the output layer. Thus the scales of the two layers of weights are unrelated, and it is inconsistent to force the characteristic decay rates of these different classes of weights to be the same. This inconsistency is the major cause of the failure illustrated in Figure 7. All the networks deviating substantially from the desired trend have weights to the output layer far larger than the weights to the input layer; this poor match to the model implicit in the regularizer causes the evidence for those solutions to be small. This failure enables us to progress with insight to new regularizers. The alternative that I now present is a prior that is not inconsistent in the way explained above, so there are theoretical reasons to expect it to be "better." However, we will allow the data to choose, by evaluating the evidence for solutions using the new prior; we will find that the new prior is indeed more probable. The second prior has three independent regularizing constants, corresponding to the characteristic magnitudes of the weights in three different classes c, namely hidden unit weights, hidden unit biases, and output weights and biases (see Fig. 9). The term aEw is replaced by CcaCE&, where Eh = CiEc 4 / 2 . Nowlan (1991) has used a similar prior modeling

Bayesian Framework for Backpropagation Networks

463

400-x Small-start 0

0

Large-start

+

0 0

o+

+

350

300

I 20

30

40

50

60

70

80

gamma

Figure 6 Data misfit versus y. This figure shows & against y, and a line of gradient -1. Toward the right, the data's misfit & is reduced by 1 for every well-measured parameter. When the model has too few parameters, however (toward the left), the misfit gets worse at a greater rate. weights as coming from a gaussian mixture, and used Bayesian reestimation techniques to update the mixture parameters; he found such a model was good at discovering elegant solutions to problems with translation invariances. Using the second prior, each regularizing constant is independently adapted to its most probable value by evaluating the number of wellmeasured parameters T~ associated with each regularizing function, and finding the optimum where 2a,Eb = -ye. The increased complexity of this prior model is penalized by an Occam factor for each new parameter ac (see MacKay 1992a). Let me preempt questions along the lines of "why didn't you use four weight classes, or nonzero means?" - any other way of assigning weight decays is just another model, and you can try as many as you like; by evaluating the evidence you can then find out what preference the data have for the alternative decay schemes. New solutions have been found using this second prior, and the evidence evaluated. The evidence for these new solutions with the new prior is shown in Figure 10. Notice that the evidence has increased com-

David J. C. MacKay

464

sma 11- s t a r t " "1.arge-st.art" 8-

500

0 +

480

460 a,

c

2

440 0

+

W >

420

*

+

+

+

O+

+o+,"

+

400

+ 380 0

360 O

450

500

550 Test e r r o r

600

+

650

Figure 7: Log evidence versus test error for the first regularizer. The desired correlation between the evidence and the test error has negative slope. A significant number of points on the lower left violate this desired trend, so we have a failure of Bayesian prediction. The points that violate the trend are networks in which there is a significant difference in typical weight magnitude between the two layers. They are all networks whose learning was initialized with a large value of UW. The first regularizer is ill-matched to such networks, and the low evidence is a reflection of this poor prior hypothesis.

pared to the evidence for the first prior. For some solutions the new prior is more probable by a factor of lo3'. Now the crunch: does this more probable model make good predictions? The evidence for the second prior is shown against the test error in Figure 11. The correlation between the two is greatly improved. Notice furthermore that not only is the second prior more probable, the best test error achieved by solutions found using the second prior is slightly better than any achieved using the first prior, and the number of good solutions has increased substantially. Thus the Bayesian evidence is a good predictor of generalization ability, and the Bayesian choice of regularizers has enabled the best solutions to be found.

Bayesian Framework for Backpropagation Networks

465

650

600

N

ii 0

1

550

m

4-

H m

500

450

400

450

550 Test error 1

500

600

650

Figure 8: Comparison of two test errors. This figure illustrates how noisy a performance measure the test error is. Each point compares the error of a trained network on two different test sets. Both test sets consist of 200 data points from the same distribution as the training set.

5 Discussion

The Bayesian method that has been presented is well-founded theoretically, and it works practically, though it remains to be seen how this approach will scale to larger problems. For a particular data set, the evaluation of the evidence has led us objectively from an inconsistent regularizer to a more probable one. The evidence is maximized for a sensible number of hidden units, showing that Occam’s razor has been successfully embodied with no ad hoc terms. Furthermore the solutions with greatest evidence perform better on a test set than any other solutions found. I believe there is currently no other technique that could reliably find and identify better solutions using only the training set. Essential to this success was the simultaneous Bayesian optimization of the three regularizing constants (decay terms) a,. Optimization of these parameters by any orthodox search technique such as cross-validation would be laborious; if there were many more than three regularizing constants,

David J. C. MacKay

466

-

layer

- Hidden ’

layer

- Input layer

Figure 9: The three classes of weights under the second prior. (1) Hidden unit weights. (2) Hidden unit biases. (3) Output unit weights and biases. The weights in one class c share the same decay constant ac. as could easily be the case in larger problems, it is hard to imagine any such search being po~sible.~ This brings up the question of how these Bayesian calculations scale with problem size. In terms of the number of parameters k, calculation of the determinant and inverse of the Hessian scales as k3. Note that this is a computation that needs to be carried out only a small number of times compared with the immense number of derivative calculations involved in a typical learning session. However, for large problems it may be too demanding to evaluate the determinant of the Hessian. If this is the case, numerical methods are available to approximate the determinant or trace of a matrix in k2 time (Skilling, 1989). 5.1 Application to Classification Problems. This paper has thus far discussed the evaluation of the evidence for backprop networks trained on interpolation problems. Neural networks can also be trained to per5Radford Neal (personal communication) has pointed out that it is possible to evaluate the gradient of a validation error with respect to parameters such as {ac},using aEv,1/8ac = a E v a l / a W ~ ~ . a W M ~ /The a a cfirst . quantity could be evaluated by backprop, and the second term could be found within the quadratic approximation which gives a w M p / a a , = A-’IcwMp, where I, is the identity matrix for the weights regularized by .a and zero elsewhere. Alternatively, Radford Neal has suggested that the gradients OEv,I /Oa, could be more efficiently calculated using “recurrent backpropagation” (Pineda 1989),viewing w as the vector of activities of a recurrent network, and WMP as the fixed point whose error E,I we wish to minimize.

Bayesian Framework for Backpropagation Networks

467

" small-st art ''

0

"large-start" + "derived" 0 "symmetries-detected" X

500

480

460

"C m

8

440

wrl>

0"

420

400

380

360 6

8

10 12 14 Number of hidden units

16

18

20

Figure 10: Log evidence versus number of hidden units for the second prior. The different point styles correspond to networks with learning initialized with small and large values of U W ; networks previously trained using the first regularizer and subsequently trained on the second regularizer; and networks in which a weight symmetry was detected (in such cases the evidence evaluation is possibly less reliable). form classification tasks. A future publication (MacKay 1992b)will demonstrate that the Bayesian framework for model comparison can be applied to these problems too. 5.2 Relation to V-C Dimension. Some papers advocate the use of V-C dimension (Abu-Mostafa 1990a) as a criterion for penalizing overcomplex models (Abu-Mostafa 1990b; Lee and Tenorio 1991). V-C dimension is most often applied to classification problems; the evidence, on the other hand, can be evaluated equally easily for interpolation and classification problems. V-C dimension is a worst case measure, so it yields different results from Bayesian analysis (Haussler et al. 1991). For example, V-C dimension is indifferent to the use of regularizers like equation 1.2, and to the value of a, because the use of such regularizers does not rule out absolutely any particular network parameters. Thus V-C dimension assigns the same complexity to a model whether or not it is

David J. C. MacKay

468

"small-start''

0

"large-start" + "derived" 0 "syrwetries-detected" x

500

480

460

a F

.$ w>

8

440

420

400

380

360 450

500

550

600

650

T e s t error

Figure 11: Log evidence for the second prior versus test error. The correlation between the evidence and the test error for the second prior is very good. Note that the largest value of evidence has increased relative to Figure 7, and the smallest test error has also decreased. regularized.6 So it cannot be used to set regularizing constants a or to compare alternative regularizers. In contrast, the preceding demonstrations show that careful objective choice of regularizer and a is essential for the best solutions to be obtained. Worst case analysis has a complementary role alongside Bayesian methods. Neither can substitute for the other. 5.3 Future Tasks. Further work is needed to formalize the relationship of this framework to the pragmatic model comparison technique of cross-validation. Moody's (1991) work on "generalized prediction error" (GPE) is an interesting contribution in this direction. His sampling theory approach predicts that the generalization error, in x2 units, will be (& + 2 y ) l N . However, I have evaluated the GPE for the interpolation models in this paper's demonstration, and found the correlation 6However, E. Levin (personal communication) has mentioned that a measure of "effective V-C dimension" of a regularized model is being developed. In some cases this measure is identical to 7 , equation 2.6.

Bayesian Framework for Backpropagation Networks

469

between GPE and the actual test error was poor. More work is needed to understand this. The gaussian approximation used to evaluate the evidence breaks down when the number of data points is small compared to the number of parameters. For the model problems I have studied so far, the gaussian approximation seemed to break down significantly for N / k < 3 f 1. It is a matter for further research to characterize this failure and investigate techniques for improving the evaluation of the integral Zb, for example the use of random walks on M in the neighborhood of a solution. It is expected that evaluation of the evidence should provide an objective rule for deciding whether a network pruning or growing procedure should be stopped, but a careful study of this idea has yet to be performed. It will be interesting to see the results of evaluating the evidence for networks applied to larger real-world problems. 6 Appendix: Numerical Methods

6.1 Quick and Dirty Version. The three numerical tasks are automatic optimization of N, and p, calculation of error bars, and evaluation of the evidence. I will describe a cheap approximation for solving the first of these tasks without evaluating the Hessian. If we neglect the distinction between well-determined and poorly determined parameters, we obtain the following update rules for CI and P: O,

p

:= kC/2EE, := N/2ED

If you want an easy-to-program taste of what a Bayesian framework can offer, try using this procedure to update your decay terms. 6.2 Hessian Evaluation. The Hessian of M, A, is needed to evaluate y (which relates to TraceA-I), to evaluate the evidence (which relates to det A), and to assign error bars to network outputs (using A-’1. I used two methods for evaluating A: (1) an approximate analytic method and ( 2 ) second differences. The approximate analytic method was, following Denker et al., to use backprop to obtain the second derivatives, neglecting terms in f”, where f is the activation function of a neuron. The Hessian is built up as a sum of outer products of gradient vectors:

where gy = dy;(xm)/dw.Unlike Denker et al., I did not ignore the offdiagonal terms; the diagonal approximation is not good enough! For

David J. C . MacKay

470

the evaluation of y the two methods gave similar results, and either approach seemed satisfactory. However, for the evaluation of the evidence, the approximate analytic method failed to give satisfactory results. The ”Occam factors” are very weak, scaling only as log N, and the above approximation apparently introduces systematic errors greater than these. The reason that the evidence evaluation is more sensitive to errors than the y evaluation is because y is related to the sum of eigenvalues, whereas the evidence is related to the product; errors in small eigenvalues jeopardize the product more than the sum. I expect an exact analytic evaluation of the second derivatives (Bishop 1992) would resolve this. To save programming effort I instead used second differences, which is computationally more demanding (- kN backprops) than the analytic approach ( w N backprops). There were still problems with errors in small eigenvalues, but it was possible to correct these errors, by detecting eigenvalues that were smaller than theoretically permitted. 6.3 Demonstrations. The demonstrations were performed as follows: Initial weight configuration: random weights drawn from a gaussian with crw = 0.3. Optimization algorithm for M(w): variable metric methods, using code from Press et al. (1988), used several times in sequence with values of the fractional tolerance decreasing from to lo-’. Every other loop, the regularizing constants ac were allowed to adapt in accordance with the reestimation formula: aC:= yC/2E&

(6.2)

6.4 Precaution. When evaluating the evidence, care must be taken to verify that the permutation term g is appropriately set. It may be the case (probably mainly in toy problems) that the regularizer makes two or more hidden units in a network adopt identical connection values; alternatively, some hidden units might switch off, with all weights set to zero; in these cases the permutation term should be smaller. Also in these cases, it is likely that the quadratic approximation will perform badly (quartic rather than quadratic minima are likely), so it is preferable to automate the deletion of such redundant units.

Acknowledgments I thank Mike Lewicki, Nick Weir, and Haim Sompolinsky for helpful

conversations, and Andreas Herz for comments on the manuscript. This work was supported by a Caltech Fellowship and a Studentship from SERC, UK.

Bayesian Framework for Backpropagation Networks

471

References Abu-Mostafa, Y. S. 1990a. The Vapnik-Chervonenkis dimension: Information versus complexity in learning. Neural Comp. 1(3), 312-317. Abu-Mostafa, Y. S. 1990b. Learning from hints in neural networks. 1.Complex. 6, 192-198. Bishop, C. M. 1992. Exact calculation of the Hessian matrix for the multilayer perceptron. Neural Cornp., in press. Denker, J. S., and Le Cun, Y. 1991. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems 3, R. P. Lippmann et al., eds., pp. 853-859. Morgan Kaufmann, San Mateo, CA. Gull, S. F. 1989. Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods, Cambridge, 1988, J. Skilling, ed., pp. 53-71. Kluwer, Dordrecht. Hanson, R., Stutz, J., and Cheeseman, P. 1991. Bayesian classification theory. NASA Ames TR FIA-90-12-7-01. Haussler, D., Kearns, M., and Schapire, R. 1991. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In Proceedings of the Fourth COLT Workshop. Morgan Kaufmann, San Mateo, CA. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, G. E. Rumelhart et al., ed., pp. 282-317. MIT Press, Cambridge, MA. Ji, C., Snapp, R. R., and Psaltis, D. 1990. Generalizing smoothness constraints from discrete samples. Neural Cornp. 2(2), 188-197. Le Cun, Y., Denker, J. S., and Solla, S. S. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. Lee, W. T., and Tenorio, M. F. 1991. On optimal adaptive classifier design criterion - How many hidden units are necessary for an optimal neural network classifier? Purdue University TR-EE-91-5. Levin, E., Tishby, N., and Solla, S. 1989. A statistical approach to learning and generalization in layered neural networks. In COLT '89: 2nd Workshop on Computational Learning Theory, pp. 245-260. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Comp. 4(3), 415447. MacKay, D. J. C. 1992b. The evidence framework applied to classification networks. Neural Cornp., to appear. Moody, J. E. 1991. Note on generalization, regularization and architecture selection in nonlinear learning systems. In First IEEE-SP Workshop on Neural Networks for Signal Processing. IEEE Computer Society Press, New York. Nowlan, S. J. 1991. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Carnegie Mellon University Doctoral thesis CS91-126. Pineda, F. J. 1989. Recurrent back-propagation and the dynamical approach to adaptive neural computation. Neural Comp. 1, 161-172.

472

David J. C. MacKay

Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1988. Numerical recipes in C. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back propagating errors. Nature (London) 323, 533-536. Rumelhart, D. E. 1987. Cited in Ji et al. 1990. Skilling, J. 1989. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian Methods, Cambridge, 1988, J. Skilling, ed., pp. 45.5466. Kluwer, Dordrecht. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In Proc. IJCNN, Washington. Walker, A. M. 1967. On the asymptotic behaviour of posterior distributions. J. R. Stat. SOC.3 31,8C-88. Weigend, A. S., Rumelhart, D. E., and Huberman, 8.A. 1991. Generalization by weight-elimination with applications to forecasting. In Advances in Neural Information Processing Systems 3, R. I? Lippmann et al., ed., pp. 875-882. Morgan Kaufmann, San Mateo, CA.

Received 21 May 1991; accepted 29 October 1991.

This article has been cited by: 1. Krzysztof M. Graczyk, Piotr Płonski, Robert Sulej. 2010. Neural network parameterizations of electromagnetic nucleon form-factors. Journal of High Energy Physics 2010:9. . [CrossRef] 2. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 3. Sheelabhadra Mohanty, Madan K. Jha, Ashwani Kumar, K. P. Sudheer. 2010. Artificial Neural Network Modeling for Groundwater Level Forecasting in a River Island of Eastern India. Water Resources Management 24:9, 1845-1865. [CrossRef] 4. S. Arangio, F. Bontempi. 2010. Soft Computing Based Multilevel Strategy for Bridge Integrity Monitoring. Computer-Aided Civil and Infrastructure Engineering 25:5, 348-362. [CrossRef] 5. Arturo Berrones. 2010. Bayesian Inference Based on Stationary Fokker-Planck SamplingBayesian Inference Based on Stationary Fokker-Planck Sampling. Neural Computation 22:6, 1573-1596. [Abstract] [Full Text] [PDF] [PDF Plus] 6. Nikos Tsimboukakis, George Tambouratzis. 2010. A comparative study on authorship attribution classification tasks using both neural network and statistical methods. Neural Computing and Applications 19:4, 573-582. [CrossRef] 7. Sujit Ghosal, Sudipto Chaki. 2010. Estimation and optimization of depth of penetration in hybrid CO2 LASER-MIG welding using ANN-optimization hybrid model. The International Journal of Advanced Manufacturing Technology 47:9-12, 1149-1157. [CrossRef] 8. Sangeeta Khare, Kyooyoung Lee, H.K.D.H. Bhadeshia. 2010. Carbide-Free Bainite: Compromise between Rate of Transformation and Properties. Metallurgical and Materials Transactions A 41:4, 922-928. [CrossRef] 9. Michael Fernandez, Julio Caballero, Leyden Fernandez, Akinori Sarai. 2010. Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM). Molecular Diversity . [CrossRef] 10. H. D. Tran, B. J. C. Perera, A. W. M. Ng. 2010. Markov and Neural Network Models for Prediction of Structural Deterioration of Storm-Water Pipe Assets. Journal of Infrastructure Systems 16:2, 167. [CrossRef] 11. Zohreh Izadifar, Amin Elshorbagy. 2010. Prediction of hourly actual evapotranspiration using neural networks, genetic programming, and statistical models. Hydrological Processes n/a-n/a. [CrossRef] 12. Michael Titterington. 2010. Neural networks. Wiley Interdisciplinary Reviews: Computational Statistics 2:1, 1-8. [CrossRef]

13. H. Yonaba, F. Anctil, V. Fortin. 2010. Comparing Sigmoid Transfer Functions for Neural Network Multistep Ahead Streamflow Forecasting. Journal of Hydrologic Engineering 15:4, 275. [CrossRef] 14. Saumen Maiti, Ram Krishna Tiwari. 2010. Automatic discriminations among geophysical signals via the Bayesian neural networks approach. Geophysics 75:1, E67. [CrossRef] 15. C. E. Alciaturi, G. Quevedo. 2009. Bayesian regularization: application to calibration in NIR spectroscopy. Journal of Chemometrics 23:11, 562-568. [CrossRef] 16. Frank R. Burden, David A. Winkler. 2009. An Optimal Self-Pruning Neural Network and Nonlinear Descriptor Selection in QSAR. QSAR & Combinatorial Science 28:10, 1092-1097. [CrossRef] 17. Mehdi Jalali-Heravi, Ahmad Mani-Varnosfaderani. 2009. QSAR Modeling of 1-(3,3-Diphenylpropyl)-Piperidinyl Amides as CCR5 Modulators Using Multivariate Adaptive Regression Spline and Bayesian Regularized Genetic Neural Networks. QSAR & Combinatorial Science 28:9, 946-958. [CrossRef] 18. F. Mateo, R. Gadea, Á. Medina, R. Mateo, M. Jiménez. 2009. Predictive assessment of ochratoxin A accumulation in grape juice based-medium by Aspergillus carbonarius using neural networks. Journal of Applied Microbiology 107:3, 915-927. [CrossRef] 19. E. A. Metzbower, D. L. Olson, N. Yurioka. 2009. Neural network analysis of oxygen in weld metals. Science and Technology of Welding & Joining 14:6, 566-569. [CrossRef] 20. J. A. Francis, G. M. D. Cantin, W. Mazur, H. K. D. H. Bhadeshia. 2009. Effects of weld preheat temperature and heat input on type IV failure. Science and Technology of Welding & Joining 14:5, 436-442. [CrossRef] 21. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 22. A. Asensio Ramos, C. Ramos Almeida. 2009. BayesCLUMPY: BAYESIAN INFERENCE WITH CLUMPY DUSTY TORUS MODELS. The Astrophysical Journal 696:2, 2075-2085. [CrossRef] 23. C.E. Pedreira, L. Macrini, M.G. Land, E.S. Costa. 2009. New Decision Support Tool for Treatment Intensity Choice in Childhood Acute Lymphoblastic Leukemia. IEEE Transactions on Information Technology in Biomedicine 13:3, 284-290. [CrossRef] 24. H. K. D. H. Bhadeshia, R. C. Dimitriu, S. Forsik, J. H. Pak, J. H. Ryu. 2009. Performance of neural networks in materials science. Materials Science and Technology 25:4, 504-510. [CrossRef]

25. H. K. D. H. Bhadeshia. 2009. Neural Networks and Information in Materials Science. Statistical Analysis and Data Mining 1:5, 296-305. [CrossRef] 26. Xuesong Zhang, Faming Liang, Raghavan Srinivasan, Michael Van Liew. 2009. Estimating uncertainty of streamflow simulation using Bayesian neural networks. Water Resources Research 45:2. . [CrossRef] 27. L. J. Lancashire, C. Lemetre, G. R. Ball. 2008. An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies. Briefings in Bioinformatics 10:3, 315-329. [CrossRef] 28. Timothy T. Rogers, James L. McClelland. 2008. Précis of Semantic Cognition: A Parallel Distributed Processing Approach. Behavioral and Brain Sciences 31:06, 689. [CrossRef] 29. Timothy T. Rogers, James L. McClelland. 2008. A simple model from a powerful framework that spans levels of analysis. Behavioral and Brain Sciences 31:06, 729. [CrossRef] 30. Yungang Cao, Xiuchun Yang, Xiaohua Zhu. 2008. Retrieval snow depth by artificial neural network methodology from integrated AMSR-E and in-situ data—A case study in Qinghai-Tibet Plateau. Chinese Geographical Science 18:4, 356-360. [CrossRef] 31. Iván Olier, Alfredo Vellido. 2008. Variational Bayesian Generative Topographic Mapping. Journal of Mathematical Modelling and Algorithms 7:4, 371-387. [CrossRef] 32. Tetsuya Shimokawa, Tadanobu Misawa, Kyoko Suzuki. 2008. Neural representation of preference relationships. NeuroReport 19:16, 1557-1561. [CrossRef] 33. Michael Fernández, Leyden Fernández, Julio Caballero, José Ignacio Abreu, Grethel Reyes. 2008. Proteochemometric Modeling of the Inhibition Complexes of Matrix Metalloproteinases with N -Hydroxy-2-[(Phenylsulfonyl)Amino]Acetamide Derivatives Using Topological Autocorrelation Interaction Matrix and Model Ensemble Averaging. Chemical Biology & Drug Design 72:1, 65-78. [CrossRef] 34. Imran Shafi, Jamil Ahmad, Syed Ismail Shah, Faisal M. Kashif. 2008. Computing Deblurred Time-Frequency Distributions Using Artificial Neural Networks. Circuits, Systems & Signal Processing 27:3, 277-294. [CrossRef] 35. Yonghua Wang, Yan Li, Jun Ding, Yuan Wang, Yaqing Chang. 2008. Prediction of binding affinity for estrogen receptor α modulators using statistical learning approaches. Molecular Diversity 12:2, 93-102. [CrossRef] 36. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 37. Tatiana Miazhynskaia, Sylvia Frühwirth-Schnatter, Georg Dorffner. 2008. Neural Network Models for Conditional Distribution Under Bayesian

AnalysisNeural Network Models for Conditional Distribution Under Bayesian Analysis. Neural Computation 20:2, 504-522. [Abstract] [PDF] [PDF Plus] 38. H. K. D. H. Bhadeshia. 2008. Mathematical models in materials science. Materials Science and Technology 24:2, 128-136. [CrossRef] 39. F LÓPEZ-GRANADOS, J M PEÑA-BARRAGÁN, M JURADO-EXPÓSITO, M Francisco-FERNÁNDEZ, R CAO, A ALONSO-BETANZOS, O FONTENLA-ROMERO. 2008. Multispectral classification of grass weeds and wheat (Triticum durum) using linear and nonparametric functional discriminant analysis and neural networks. Weed Research 48:1, 28-37. [CrossRef] 40. J. H. Goodband, O. C. L. Haas, J. A. Mills. 2008. A comparison of neural network approaches for on-line prediction in IGRT. Medical Physics 35:3, 1113. [CrossRef] 41. Vahid Vahidinasab, Shahram Jadid. 2008. Bayesian neural network model to predict day-ahead electricity prices. European Transactions on Electrical Power n/a-n/a. [CrossRef] 42. John Y. Goulermas, Panos Liatsis, Xiao-Jun Zeng, Phil Cook. 2007. Density-Driven Generalized Regression Neural Networks (DD-GRNN) for Function Approximation. IEEE Transactions on Neural Networks 18:6, 1683-1696. [CrossRef] 43. Heng-Chao Li, Wen Hong, Yi-Rong Wu, Heng-Ming Tai. 2007. Texture-Preserving Despeckling of SAR Images Using Evidence Framework. IEEE Geoscience and Remote Sensing Letters 4:4, 537-541. [CrossRef] 44. Andrew Skabar. 2007. Mineral Potential Mapping Using Bayesian Learning for Multilayer Perceptrons. Mathematical Geology 39:5, 439-451. [CrossRef] 45. Zheng Rong Yang. 2007. A Probabilistic Peptide Machine for Predicting Hepatitis C Virus Protease Cleavage Sites. IEEE Transactions on Information Technology in Biomedicine 11:5, 593-595. [CrossRef] 46. ANDREW SKABAR. 2007. MODELING THE SPATIAL DISTRIBUTION OF MINERAL DEPOSITS USING NEURAL NETWORKS. Natural Resource Modeling 20:3, 435-450. [CrossRef] 47. R. C. Dimitriu, H. K. D. H. Bhadeshia. 2007. Hot strength of creep resistant ferritic steels and relationship to creep rupture data. Materials Science and Technology 23:9, 1127-1131. [CrossRef] 48. Faming Liang. 2007. Annealing stochastic approximation Monte Carlo algorithm for neural network training. Machine Learning 68:3, 201-233. [CrossRef] 49. O. Yu. Mytnik. 2007. Construction of Bayesian support vector regression in the feature space spanned by Bezier-Bernstein polynomial functions. Cybernetics and Systems Analysis 43:4, 613-620. [CrossRef]

50. S. Chatterjee, M. Murugananth, H. K. D. H. Bhadeshia. 2007. δ TRIP steel. Materials Science and Technology 23:7, 819-827. [CrossRef] 51. Carlos Garcia-Mateo, Carlos Capdevila, Francisca Garcia Caballero, Carlos García Andrés. 2007. Artificial neural network modeling for the prediction of critical transformation temperatures in steels. Journal of Materials Science 42:14, 5391-5397. [CrossRef] 52. Matthew C Coleman, David E Block. 2007. Nonlinear experimental design using Bayesian regularized neural networks. AIChE Journal 53:6, 1496-1509. [CrossRef] 53. Leyden Fernández, Julio Caballero, José Ignacio Abreu, Michael Fernández. 2007. Amino acid sequence autocorrelation vectors and bayesian-regularized genetic neural networks for modeling protein conformational stability: Gene V protein mutants. Proteins: Structure, Function, and Bioinformatics 67:4, 834-852. [CrossRef] 54. Chong Wang. 2007. Variational Bayesian Approach to Canonical Correlation Analysis. IEEE Transactions on Neural Networks 18:3, 905-910. [CrossRef] 55. Ueli Meier, Andrew Curtis, Jeannot Trampert. 2007. Global crustal thickness from neural network inversion of surface wave data. Geophysical Journal International 169:2, 706-722. [CrossRef] 56. Michael Fernández, Julio Caballero. 2007. QSAR models for predicting the activity of non-peptide luteinizing hormone-releasing hormone (LHRH) antagonists derived from erythromycin A using quantum chemical properties. Journal of Molecular Modeling 13:4, 465-476. [CrossRef] 57. S. G. Pierce, K. Worden, G. Manson. 2007. Evaluation of neural network performance and generalisation using thresholding functions. Neural Computing and Applications 16:2, 109-124. [CrossRef] 58. Feng Zheng, Yejun Qin, Kun Chen. 2007. Sensitivity map of laser tweezers Raman spectroscopy for single-cell analysis of colorectal cancer. Journal of Biomedical Optics 12:3, 034002. [CrossRef] 59. Vladimir M. Krasnopolsky. 2007. Neural network emulations for complex multidimensional geophysical mappings: Applications of neural network techniques to atmospheric and oceanic satellite retrievals and numerical modeling. Reviews of Geophysics 45:3. . [CrossRef] 60. I. Shafi, J. Ahmad, S.I. Shah, F.M. Kashif. 2007. Evolutionary time–frequency distributions using Bayesian regularised neural network model. IET Signal Processing 1:2, 97. [CrossRef] 61. Arta A. Jamshidi, Michael J. Kirby. 2007. Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SIAM Journal on Scientific Computing 29:3, 941. [CrossRef] 62. Julio Caballero, Alain Tundidor-Camba, Michael Fernández. 2007. Modeling of the Inhibition Constant (Ki) of Some Cruzain Ketone-Based Inhibitors

Using 2D Spatial Autocorrelation Vectors and Data-Diverse Ensembles of Bayesian-Regularized Genetic Neural Networks. QSAR & Combinatorial Science 26:1, 27-40. [CrossRef] 63. Jianfei Dong, K. V. R. Subrahmanyam, Yoke San Wong, Geok Soon Hong, A. R. Mohanty. 2006. Bayesian-inference-based neural networks for tool wear estimation. The International Journal of Advanced Manufacturing Technology 30:9-10, 797-807. [CrossRef] 64. Michael Fernández, Julio Caballero. 2006. Ensembles of Bayesian-regularized Genetic Neural Networks for Modeling of Acetylcholinesterase Inhibition by Huprines. Chemical Biology & Drug Design 68:4, 201-212. [CrossRef] 65. Miguel Pinzolas, Ana Toledo, Juan Luís Pedreño. 2006. A Neighborhood-Based Enhancement of the Gauss-Newton Bayesian Regularization Training MethodA Neighborhood-Based Enhancement of the Gauss-Newton Bayesian Regularization Training Method. Neural Computation 18:8, 1987-2003. [Abstract] [PDF] [PDF Plus] 66. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 67. S. Kar, T. Searles, E. Lee, G. B. Viswanathan, H. L. Fraser, J. Tiley, R. Banerjee. 2006. Modeling the tensile properties in β-processed α/β Ti alloys. Metallurgical and Materials Transactions A 37:3, 559-566. [CrossRef] 68. G.C. Cawley, N.L.C. Talbot, G.J. Janacek, M.W. Peck. 2006. Sparse Bayesian Kernel Survival Analysis for Modeling the Growth Domain of Microbial Pathogens. IEEE Transactions on Neural Networks 17:2, 471-481. [CrossRef] 69. P. Lauret, E. Fock, T.A. Mara. 2006. A Node Pruning Algorithm Based on a Fourier Amplitude Sensitivity Test Method. IEEE Transactions on Neural Networks 17:2, 273-293. [CrossRef] 70. E. Keehan, L. Karlsson, H.-O. Andrén, H. K. D. H. Bhadeshia. 2006. Influence of carbon, manganese and nickel on microstructure and properties of strong steel weld metals: Part 3 – Increased strength resulting from carbon additions. Science and Technology of Welding & Joining 11:1, 19-24. [CrossRef] 71. WILLIAM J. SACKS, DAVID S. SCHIMEL, RUSSELL K. MONSON, BOBBY H. BRASWELL. 2006. Model-data synthesis of diurnal and seasonal CO2 fluxes at Niwot Ridge, Colorado. Global Change Biology 12:2, 240-259. [CrossRef] 72. Philippe Lauret, Mathieu David, Eric Fock, Alain Bastide, Carine Riviere. 2006. Bayesian and Sensitivity Analysis Approaches to Modeling the Direct Solar Irradiance. Journal of Solar Energy Engineering 128:3, 394. [CrossRef] 73. Jianting Guo, Jieshan Hou, Lanzhang Zhou, Hengqiang Ye. 2006. Prediction and Improvement of Mechanical Properties of Corrosion Resistant Superalloy K44

with Adjusting Minor Additions C, B and Hf. MATERIALS TRANSACTIONS 47:1, 198-206. [CrossRef] 74. D. I. Doughan, L. M. Raff, M. G. Rockley, M. Hagan, Paras M. Agrawal, R. Komanduri. 2006. Theoretical investigation of the dissociation dynamics of vibrationally excited vinyl bromide on an ab initio potential-energy surface obtained using modified novelty sampling and feed-forward neural networks. The Journal of Chemical Physics 124:5, 054321. [CrossRef] 75. Jian-hua Xu, Xue-gong Zhang, Yan-da Li. 2006. Regularized Kernel Forms of Minimum Squared Error Method. Frontiers of Electrical and Electronic Engineering in China 1:1, 1-7. [CrossRef] 76. Y. Xu, K.-W. Kwok-WoWong, C.-S. Leung. 2006. Generalized RLS Approach to the Training of Neural Networks. IEEE Transactions on Neural Networks 17:1, 19-34. [CrossRef] 77. Dennis Norris. 2006. The Bayesian reader: Explaining word recognition as an optimal Bayesian decision process. Psychological Review 113:2, 327-357. [CrossRef] 78. V. Rossi, J.-P. Vila. 2006. Bayesian Multioutput Feedforward Neural Networks Comparison: A Conjugate Prior Approach. IEEE Transactions on Neural Networks 17:1, 35-47. [CrossRef] 79. David. J. Bescoby, Gavin C. Cawley, P. Neil Chroston. 2006. Enhanced interpretation of magnetic survey data from archaeological sites using artificial neural networks. Geophysics 71:5, H45. [CrossRef] 80. Masataro Ohmi, Hiroyuki Mori. 2006. A Gaussian Processes Technique for Short-term Load Forecasting with Considerations of Uncertainty. IEEJ Transactions on Power and Energy 126:2, 202-208. [CrossRef] 81. Marcelo C. Medeiros, Timo Teräsvirta, Gianluigi Rech. 2006. Building neural network models for time series: a statistical approach. Journal of Forecasting 25:1, 49-75. [CrossRef] 82. Sounak Chakraborty, Malay Ghosh, Tapabrata Maiti, Ashutosh Tewari. 2005. Bayesian neural networks for bivariate binary data: an application to prostate cancer study. Statistics in Medicine 24:23, 3645-3662. [CrossRef] 83. C. Garcia-Mateo, T. Sourmail, F. G. Caballero, C. Capdevila, C. García de Andrés. 2005. New approach for the bainite start temperature calculation in steels. Materials Science and Technology 21:8, 934-940. [CrossRef] 84. Faming Liang . 2005. Evidence Evaluation for Bayesian Neural Networks Using Contour Monte CarloEvidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo. Neural Computation 17:6, 1385-1410. [Abstract] [PDF] [PDF Plus] 85. Yong-Hua Wang, Yan Li, Sheng-Li Yang, Ling Yang. 2005. An in silico approach for screening flavonoids as P-glycoprotein inhibitors based on a

Bayesian-regularized neural network. Journal of Computer-Aided Molecular Design 19:3, 137-147. [CrossRef] 86. Bobby H. Braswell, William J. Sacks, Ernst Linder, David S. Schimel. 2005. Estimating diurnal to annual ecosystem parameters by synthesis of a carbon flux model with eddy covariance net ecosystem exchange observations. Global Change Biology 11:2, 335-355. [CrossRef] 87. Greer B. Kingston, Martin F. Lambert, Holger R. Maier. 2005. Bayesian training of artificial neural networks used for water resources modeling. Water Resources Research 41:12. . [CrossRef] 88. Hamid Moradkhani. 2005. Uncertainty assessment of hydrologic model states and parameters: Sequential data assimilation using the particle filter. Water Resources Research 41:5. . [CrossRef] 89. C. CAPDEVILA, F. G. CABALLERO, C. GARCÍA DE ANDRÉS. 2005. Neural Network Model for Isothermal Pearlite Transformation. Part II: Growth Rate. ISIJ International 45:2, 238-247. [CrossRef] 90. Celestino Ordóñez, Javier Taboada, Fernando Bastante, Jose María Matías, Angel Manuel Felicísimo. 2005. Learning Machines Applied to Potential Forest Distribution. Environmental Management 35:1, 109-120. [CrossRef] 91. M.C. Medeiros, A. Veiga. 2005. A Flexible Coefficient Smooth Transition Time Series Model. IEEE Transactions on Neural Networks 16:1, 97-113. [CrossRef] 92. C. Xiang, S. Ding, T.H. Lee. 2005. Geometrical Interpretation and Architecture Selection of MLP. IEEE Transactions on Neural Networks 16:1, 84-96. [CrossRef] 93. L. M. Raff, M. Malshe, M. Hagan, D. I. Doughan, M. G. Rockley, R. Komanduri. 2005. Ab initio potential-energy surfaces for complex, multichannel systems using modified novelty sampling and feedforward neural networks. The Journal of Chemical Physics 122:8, 084104. [CrossRef] 94. C. CAPDEVILA, F. G. CABALLERO, C. GARCÍA DE ANDRÉS. 2005. Neural Network Model for Isothermal Pearlite Transformation. Part I: Interlamellar Spacing. ISIJ International 45:2, 229-237. [CrossRef] 95. Faming Liang. 2005. Bayesian neural networks for nonlinear time series forecasting. Statistics and Computing 15:1, 13-29. [CrossRef] 96. 2004. A Study on the Bayesian Recurrent Neural Network for Time Series Prediction. Journal of Institute of Control, Robotics and Systems 10:12, 1295-1304. [CrossRef] 97. F. Aires , C. Prigent , W. B. Rossow . 2004. Neural Network Uncertainty Assessment Using Bayesian Statistics: A Remote Sensing ApplicationNeural Network Uncertainty Assessment Using Bayesian Statistics: A Remote Sensing Application. Neural Computation 16:11, 2415-2458. [Abstract] [PDF] [PDF Plus]

98. S. Sigurdsson, P.A. Philipsen, L.K. Hansen, J. Larsen, M. Gniadecka, H.C. Wulf. 2004. Detection of Skin Cancer by Classification of Raman Spectra. IEEE Transactions on Biomedical Engineering 51:10, 1784-1793. [CrossRef] 99. Vasily Belokurov, N. Wyn Evans, Yann Le Du. 2004. Light-curve classification in massive variability surveys - II. Transients towards the Large Magellanic Cloud. Monthly Notices of the Royal Astronomical Society 352:1, 233-242. [CrossRef] 100. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 101. M Oud, E J W Maarsingh. 2004. Spirometry and forced oscillometry assisted optimal frequency band determination for the computerized analysis of tracheal lung sounds in asthma. Physiological Measurement 25:3, 595-606. [CrossRef] 102. A. Ilin, H. Valpola, E. Oja. 2004. Nonlinear Dynamical Factor Analysis for State Change Detection. IEEE Transactions on Neural Networks 15:3, 559-575. [CrossRef] 103. Luis Aguirre, Rafael Lopes, Gleison Amaral, Christophe Letellier. 2004. Constraining the topology of neural networks to ensure dynamics with symmetry properties. Physical Review E 69:2. . [CrossRef] 104. Céline Cornet. 2004. Neural network retrieval of cloud parameters of inhomogeneous clouds from multispectral and multiscale radiance data: Feasibility study. Journal of Geophysical Research 109:D12. . [CrossRef] 105. F. Aires. 2004. Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 3. Network Jacobians. Journal of Geophysical Research 109:D10. . [CrossRef] 106. F. Padberg, T. Ragg, R. Schoknecht. 2004. Using machine learning for estimating the defect content after an inspection. IEEE Transactions on Software Engineering 30:1, 17-28. [CrossRef] 107. F. Aires. 2004. Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 2. Output errors. Journal of Geophysical Research 109:D10. . [CrossRef] 108. W. Chu, S.S. Keerthi, C.J. Ong. 2004. Bayesian Support Vector Regression Using a Unified Loss Function. IEEE Transactions on Neural Networks 15:1, 29-44. [CrossRef] 109. F. Aires. 2004. Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 1. Network weights. Journal of Geophysical Research 109:D10. . [CrossRef] 110. Wei Chu , S. Sathiya Keerthi , Chong Jin Ong . 2003. Bayesian Trigonometric Support Vector ClassifierBayesian Trigonometric Support Vector Classifier. Neural Computation 15:9, 2227-2254. [Abstract] [PDF] [PDF Plus]

111. Faming Liang . 2003. An Effective Bayesian Neural Network Classifier with a Comparison Study to Support Vector MachineAn Effective Bayesian Neural Network Classifier with a Comparison Study to Support Vector Machine. Neural Computation 15:8, 1959-1989. [Abstract] [PDF] [PDF Plus] 112. Chee M. Ng. 2003. Comparison of Neural Network, Bayesian, and Multiple Stepwise Regression–Based Limited Sampling Models to Estimate Area Under the Curve. Pharmacotherapy 23:8, 1044-1051. [CrossRef] 113. I. Rivals, L. Personnaz. 2003. Neural-network construction and selection in nonlinear modeling. IEEE Transactions on Neural Networks 14:4, 804-819. [CrossRef] 114. E Haber, L Tenorio. 2003. Learning regularization functionals a supervised training approach. Inverse Problems 19:3, 611-626. [CrossRef] 115. A.A. Ding, Xiali He. 2003. Backpropagation of pseudoerrors: neural networks that are adaptive to heterogeneous noise. IEEE Transactions on Neural Networks 14:2, 253-262. [CrossRef] 116. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 117. D. Chakraborty, N.R. Pal. 2003. A novel training scheme for multilayered perceptrons to realize proper generalization and incremental learning. IEEE Transactions on Neural Networks 14:1, 1-14. [CrossRef] 118. Olivier Jourdan. 2003. Statistical analysis of cloud light scattering and microphysical properties obtained from airborne measurements. Journal of Geophysical Research 108:D5. . [CrossRef] 119. Harri Valpola , Juha Karhunen . 2002. An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space ModelsAn Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models. Neural Computation 14:11, 2647-2692. [Abstract] [PDF] [PDF Plus] 120. M. Oud. 2002. Internal-state analysis in a layered artificial neural network trained to categorize lung sounds. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 32:6, 757-760. [CrossRef] 121. Aki Vehtari , Jouko Lampinen . 2002. Bayesian Model Assessment and Comparison Using Cross-Validation Predictive DensitiesBayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities. Neural Computation 14:10, 2439-2468. [Abstract] [PDF] [PDF Plus] 122. Mauricio Carbera-Rios, Konstantin S. Zuyev, Xu Chen, Jose M. Castro, Elliott J. Straus. 2002. Optimizing injection gate location and cycle time for the in-mold coating (IMC) process. Polymer Composites 23:5, 723-738. [CrossRef] 123. Kazuyuki Tanaka. 2002. Statistical-mechanical approach to image processing. Journal of Physics A: Mathematical and General 35:37, R81-R150. [CrossRef]

124. Martin T. Hagan, Howard B. Demuth, Orlando De Jes�s. 2002. An introduction to the use of neural networks in control systems. International Journal of Robust and Nonlinear Control 12:11, 959-985. [CrossRef] 125. Tommi Kärkkäinen . 2002. MLP in Layer-Wise Form with Applications to Weight DecayMLP in Layer-Wise Form with Applications to Weight Decay. Neural Computation 14:6, 1451-1480. [Abstract] [PDF] [PDF Plus] 126. Gaétan Monari , Gérard Dreyfus . 2002. Local Overfitting Control via LeveragesLocal Overfitting Control via Leverages. Neural Computation 14:6, 1481-1506. [Abstract] [PDF] [PDF Plus] 127. Peter Sollich , Anason Halees . 2002. Learning Curves for Gaussian Process Regression: Approximations and BoundsLearning Curves for Gaussian Process Regression: Approximations and Bounds. Neural Computation 14:6, 1393-1428. [Abstract] [PDF] [PDF Plus] 128. P.M. Wong, A.G. Bruce, T.D. Gedeon. 2002. Confidence bounds of petrophysical predictions from conventional neural networks. IEEE Transactions on Geoscience and Remote Sensing 40:6, 1440-1444. [CrossRef] 129. P.J. Edwards, A.M. Peacock, D. Renshaw, J.M. Hannah, A.F. Murray. 2002. Minimizing risk using prediction uncertainty in neural network estimation fusion and its application to papermaking. IEEE Transactions on Neural Networks 13:3, 726-731. [CrossRef] 130. Christophe Andrieu , Nando de Freitas , Arnaud Doucet . 2001. Robust Full Bayesian Learning for Radial Basis NetworksRobust Full Bayesian Learning for Radial Basis Networks. Neural Computation 13:10, 2359-2407. [Abstract] [PDF] [PDF Plus] 131. W. Wang , P. Jones , D. Partridge . 2001. A Comparative Study of Feature-Salience Ranking TechniquesA Comparative Study of Feature-Salience Ranking Techniques. Neural Computation 13:7, 1603-1623. [Abstract] [PDF] [PDF Plus] 132. Masa-aki Sato . 2001. Online Model Selection Based on the Variational BayesOnline Model Selection Based on the Variational Bayes. Neural Computation 13:7, 1649-1681. [Abstract] [PDF] [PDF Plus] 133. Bai-Ling Zhang, R. Coggins, M.A. Jabri, D. Dersch, B. Flower. 2001. Multiresolution forecasting for futures trading using wavelet decompositions. IEEE Transactions on Neural Networks 12:4, 765-775. [CrossRef] 134. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 135. G. Papadopoulos, P.J. Edwards, A.F. Murray. 2001. Confidence estimation methods for neural networks: a practical comparison. IEEE Transactions on Neural Networks 12:6, 1278-1287. [CrossRef]

136. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 137. John H. Xin, Sijie Shao, Korris Fu-lai Chung. 2000. Colour-appearance modeling using feedforward networks with Bayesian regularization method? Part I: Forward model. Color Research & Application 25:6, 424-434. [CrossRef] 138. D. Bazell. 2000. Feature relevance in morphological galaxy classification. Monthly Notices of the Royal Astronomical Society 316:3, 519-528. [CrossRef] 139. J. Mateos, A.K. Katsaggelos, R. Molina. 2000. A Bayesian approach for the estimation and transmission of regularization parameters for reducing blocking artifacts. IEEE Transactions on Image Processing 9:7, 1200-1215. [CrossRef] 140. M S Hazell, R Jones, P E Luffman, R F Sewell. 2000. Measurement Science and Technology 11:3, 227-236. [CrossRef] 141. J.-P. Vila, V. Wagner, P. Neveu. 2000. Bayesian nonlinear model selection and neural networks: a conjugate prior approach. IEEE Transactions on Neural Networks 11:2, 265-278. [CrossRef] 142. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef] 143. C. Goutte, F.A. Nielsen, K.H. Hansen. 2000. Modeling the hemodynamic response in fMRI using smooth FIR filters. IEEE Transactions on Medical Imaging 19:12, 1188-1201. [CrossRef] 144. S. Roberts, I. Rezek, R. Everson, H. Stone, S. Wilson, C. Alford. 2000. Automated assessment of vigilance using committees of radial basis function analysers. IEE Proceedings - Science, Measurement and Technology 147:6, 333. [CrossRef] 145. C.C. Holmes, B.K. Mallick. 2000. Bayesian wavelet networks for nonparametric regression. IEEE Transactions on Neural Networks 11:1, 27-35. [CrossRef] 146. D.J.C. Mackay, M.N. Gibbs. 2000. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks 11:6, 1458-1464. [CrossRef] 147. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 148. A. Veiga, M.C. Medeiros. 2000. A hybrid linear-neural model for time series forecasting. IEEE Transactions on Neural Networks 11:6, 1402-1412. [CrossRef] 149. Rudolf Kulhavý, Petya Ivanova. 1999. Quo vadis, Bayesian identification?. International Journal of Adaptive Control and Signal Processing 13:6, 469-485. [CrossRef] 150. David J. C. MacKay . 1999. Comparison of Approximate Methods for Handling HyperparametersComparison of Approximate Methods for Handling

Hyperparameters. Neural Computation 11:5, 1035-1068. [Abstract] [PDF] [PDF Plus] 151. Yukito Iba. 1999. Journal of Physics A: Mathematical and General 32:21, 3875-3888. [CrossRef] 152. John Sum , Chi-sing Leung , Gilbert H. Young , Lai-wan Chan , Wing-kay Kan . 1999. An Adaptive Bayesian Pruning for Neural Networks in a Non-Stationary EnvironmentAn Adaptive Bayesian Pruning for Neural Networks in a Non-Stationary Environment. Neural Computation 11:4, 965-976. [Abstract] [PDF] [PDF Plus] 153. J. Clark, K. Gernoth, S. Dittmar, M. Ristig. 1999. Higher-order probabilistic perceptrons as Bayesian inference engines. Physical Review E 59:5, 6161-6174. [CrossRef] 154. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef] 155. J.T.-Y. Kwok. 1999. Moderating the outputs of support vector machine classifiers. IEEE Transactions on Neural Networks 10:5, 1018-1031. [CrossRef] 156. R. Molina, A.K. Katsaggelos, J. Mateos. 1999. Bayesian and regularization methods for hyperparameter estimation in image restoration. IEEE Transactions on Image Processing 8:2, 231-246. [CrossRef] 157. W.A. Wright. 1999. Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Networks 10:6, 1261-1270. [CrossRef] 158. P.J. Edwards, A.F. Murray, G. Papadopoulos, A.R. Wallace, J. Barnard, G. Smith. 1999. The application of neural networks to the papermaking industry. IEEE Transactions on Neural Networks 10:6, 1456-1464. [CrossRef] 159. Sheng Ma, Chuanyi Ji. 1999. Performance and efficiency: recent advances in supervised learning. Proceedings of the IEEE 87:9, 1519-1535. [CrossRef] 160. N.K. Treadgold, T.D. Gedeon. 1999. Exploring constructive cascade networks. IEEE Transactions on Neural Networks 10:6, 1335-1350. [CrossRef] 161. Akio Utsugi . 1998. Density Estimation by Mixture Models with Smoothing PriorsDensity Estimation by Mixture Models with Smoothing Priors. Neural Computation 10:8, 2115-2135. [Abstract] [PDF] [PDF Plus] 162. Michael D. Lee . 1998. Neural Feature Abstraction from Judgments of SimilarityNeural Feature Abstraction from Judgments of Similarity. Neural Computation 10:7, 1815-1830. [Abstract] [PDF] [PDF Plus] 163. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef] 164. Christopher K. I. Williams . 1998. Computation with Infinite Neural NetworksComputation with Infinite Neural Networks. Neural Computation 10:5, 1203-1216. [Abstract] [PDF] [PDF Plus]

165. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 166. Siegfried Bös. 1998. Statistical mechanics approach to early stopping and weight decay. Physical Review E 58:1, 833-844. [CrossRef] 167. Siegfried Bös, Manfred Opper. 1998. Journal of Physics A: Mathematical and General 31:21, 4835-4850. [CrossRef] 168. Peter J. Edwards, Alan F. Murray. 1998. Toward Optimally Distributed ComputationToward Optimally Distributed Computation. Neural Computation 10:4, 987-1005. [Abstract] [PDF] [PDF Plus] 169. D.A. Miller, J.M. Zurada. 1998. A dynamical system perspective of structural learning with forgetting. IEEE Transactions on Neural Networks 9:3, 508-515. [CrossRef] 170. Peter Müller, David Rios Insua. 1998. Issues in Bayesian Analysis of Neural Network ModelsIssues in Bayesian Analysis of Neural Network Models. Neural Computation 10:3, 749-770. [Abstract] [PDF] [PDF Plus] 171. S.J. Roberts, D. Husmeier, I. Rezek, W. Penny. 1998. Bayesian approaches to Gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:11, 1133-1142. [CrossRef] 172. B. LeBaron, A.S. Weigend. 1998. A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9:1, 213-220. [CrossRef] 173. Sheng Ma, Chuanyi Ji. 1998. Fast training of recurrent networks based on the EM algorithm. IEEE Transactions on Neural Networks 9:1, 11-26. [CrossRef] 174. Frederico V. Prudente, Paulo H. Acioli, J. J. Soares Neto. 1998. The fitting of potential energy surfaces using neural networks: Application to the study of vibrational levels of H[sub 3][sup +]. The Journal of Chemical Physics 109:20, 8801. [CrossRef] 175. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 176. Akio Utsugi. 1997. Hyperparameter Selection for Self-Organizing MapsHyperparameter Selection for Self-Organizing Maps. Neural Computation 9:3, 623-635. [Abstract] [PDF] [PDF Plus] 177. Padhraic Smyth , David Heckerman , Michael I. Jordan . 1997. Probabilistic Independence Networks for Hidden Markov Probability ModelsProbabilistic Independence Networks for Hidden Markov Probability Models. Neural Computation 9:2, 227-269. [Abstract] [PDF] [PDF Plus] 178. Vijay Balasubramanian . 1997. Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability DistributionsStatistical

Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions. Neural Computation 9:2, 349-368. [Abstract] [PDF] [PDF Plus] 179. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 180. Anders Krogh, Peter Sollich. 1997. Statistical mechanics of ensemble learning. Physical Review E 55:1, 811-825. [CrossRef] 181. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131-1148. [CrossRef] 182. Stephen J. Roberts, Will Penny. 1997. Neural networks: friends or foes?. Sensor Review 17:1, 64-70. [CrossRef] 183. David Faraggi, R. Simon, E. Yaskil, A. Kramar. 1997. Bayesian Neural Network Models for Censored Data. Biometrical Journal 39:5, 519-532. [CrossRef] 184. Sung-Bae Cho. 1997. Neural-network classifiers for recognizing totally unconstrained handwritten numerals. IEEE Transactions on Neural Networks 8:1, 43-53. [CrossRef] 185. J. T. Connor. 1996. A robust neural network filter for electricity demand prediction. Journal of Forecasting 15:6, 437-458. [CrossRef] 186. R.D. Morris, A.D.M. Garvin. 1996. Fast probabilistic self-structuring of generalized single-layer networks. IEEE Transactions on Neural Networks 7:4, 881-888. [CrossRef] 187. Manfred Opper, Ole Winther. 1996. Mean Field Approach to Bayes Learning in Feed-Forward Neural Networks. Physical Review Letters 76:11, 1964-1967. [CrossRef] 188. Takio Kurita, Hideki Asoh, Shinji Umeyama, Shotaro Akaho, Akitaka Hosomi. 1996. Influence of noises added to hidden units on learning of multilayer perceptrons and structurization of networks. Systems and Computers in Japan 27:11, 64-73. [CrossRef] 189. H.H. Thodberg. 1996. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks 7:1, 56-72. [CrossRef] 190. Huaiyu Zhu, Richard Rohwer. 1995. Bayesian invariant measurements of generalization. Neural Processing Letters 2:6, 28-31. [CrossRef] 191. Gustavo Deco, Bernd Schürmann. 1995. Statistical-ensemble theory of redundancy reduction and the duality between unsupervised and supervised neural learning. Physical Review E 52:6, 6580-6587. [CrossRef] 192. Gustavo Deco, Dragan Obradovic. 1995. Statistical physics theory of query learning by an ensemble of higher-order neural networks. Physical Review E 52:2, 1953-1957. [CrossRef]

193. G Marion, D Saad. 1995. Journal of Physics A: Mathematical and General 28:8, 2159-2171. [CrossRef] 194. J M Pryce, A D Bruce. 1995. Journal of Physics A: Mathematical and General 28:3, 511-532. [CrossRef] 195. G. Deco , W. Finnoff , H. G. Zimmermann . 1995. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer NetworksUnsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks. Neural Computation 7:1, 86-107. [Abstract] [PDF] [PDF Plus] 196. Peter M. Williams . 1995. Bayesian Regularization and Pruning Using a Laplace PriorBayesian Regularization and Pruning Using a Laplace Prior. Neural Computation 7:1, 117-143. [Abstract] [PDF] [PDF Plus] 197. Chris M. Bishop . 1995. Training with Noise is Equivalent to Tikhonov RegularizationTraining with Noise is Equivalent to Tikhonov Regularization. Neural Computation 7:1, 108-116. [Abstract] [PDF] [PDF Plus] 198. Lars Kai Hansen , Carl Edward Rasmussen . 1994. Pruning from Adaptive RegularizationPruning from Adaptive Regularization. Neural Computation 6:6, 1223-1232. [Abstract] [PDF] [PDF Plus] 199. Nenad Ivezic, James H. Garrett. 1994. A neural network-based machine learning approach for supporting synthesis. Artificial Intelligence for Engineering, Design, Analysis and Manufacturing 8:02, 143. [CrossRef] 200. Barak A. Pearlmutter . 1994. Fast Exact Multiplication by the HessianFast Exact Multiplication by the Hessian. Neural Computation 6:1, 147-160. [Abstract] [PDF] [PDF Plus] 201. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 202. Peter M. Williams. 1993. Aeromagnetic compensation using neural networks. Neural Computing & Applications 1:3, 207-214. [CrossRef] 203. Timothy Watkin, Albrecht Rau, Michael Biehl. 1993. The statistical mechanics of learning a rule. Reviews of Modern Physics 65:2, 499-556. [CrossRef] 204. David J. C. MacKay . 1992. The Evidence Framework Applied to Classification NetworksThe Evidence Framework Applied to Classification Networks. Neural Computation 4:5, 720-736. [Abstract] [PDF] [PDF Plus] 205. David J. C. MacKay. 1992. Bayesian InterpolationBayesian Interpolation. Neural Computation 4:3, 415-447. [Abstract] [PDF] [PDF Plus] 206. Terrence L. FineProbability, Foundations of-II . [CrossRef]

ARTICLE

Communicated by Alan Lapedes

Simplifying Neural Networks by Soft Weight-Sharing Steven J. Nowlan Computational Neurobiology Laboratory, The Salk Institute, P.O. Box 5800, San Diego, C A 92186-5800 USA

Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto, Canada M5S 1A4

One way of simplifying neural networks so they generalize better is to add an extra term to the error function that will penalize complexity. Simple versions of this approach include penalizing the sum of the squares of the weights or penalizing the number of nonzero weights. We propose a more complicated penalty term in which the distribution of weight values is modeled as a mixture of multiple gaussians. A set of weights is simple if the weights have high probability density under the mixture model. This can be achieved by clustering the weights into subsets with the weights in each cluster having very similar values. Since we do not know the appropriate means or variances of the clusters in advance, we allow the parameters of the mixture model to adapt at the same time as the network learns. Simulations on two different problems demonstrate that this complexity term is more effective than previous complexity terms. 1 Introduction A major problem in training artificial neural networks is to ensure that they will generalize well to cases that they have not been trained on. Some recent theoretical results (Baum and Haussler 1989) have suggested that in order to guarantee good generalization the amount of information required to specify directly the output vectors of all the training cases must be considerably larger than the number of independent weights in the network. In many practical problems there is only a small amount of labeled data available for training and this creates problems for any approach that uses a large, homogeneous network in order to avoid the detailed task analysis required to design a network with fewer independent weights and a specific architecture that is appropriate to the task. As a result, there has been much recent interest in techniques that can train large networks with relatively small amounts of labeled data and still provide good generalization performance. Neural Computation 4,473-493 (1992) @ 1992 Massachusetts Institute of Technology

474

Steven J. Nowlan and Geoffrey E. Hinton

One way to achieve this goal is to reduce the effective number of free parameters in the network. A number of authors (e.g., Rumelhart et al. 1986, chapter 8; Lang et al. 1990; le Cun 1989, 1987) have proposed the idea of weight sharing, in which' a single weight is shared among many connections in the network so that the number of adjustable weights in the network is much less than the number of connections. This approach is effective when the problem being addressed is quite well understood, so that it is possible to specify, in advance, which weights should be identical (le Cun et al. 1990). Another approach is to use a network with too many weights, but to stop the training before overlearning on the training set has occurred (Morgan and Bourlard 1989; Weigend et al. 1990). In addition to the usual training and testing sets, a validation set is used. When performance on the validation set starts to decrease the network is beginning to overfit the training set and training is stopped. Some experience with this technique has suggested that its effectiveness can be quite sensitive to the particular stopping criterion used (Weigend et ul, 19901.' Yet another approach to the generalization problem attempts to remove excess weights from the network, either during or after training, to improve the generalization performance. Mozer and Smolensky (1989) and le Cun et al. (1990) have both proposed techniques in which a network is initially trained with an excess number of parameters and then a criterion is used to remove redundant parameters. The reduced network is then trained further. The cycle of reduction and retraining may be repeated more than once. The approach of Mozer and Smolensky (1989) estimates the relevance of individual units to network performance and removes redundant units and their weights. The method of le Cun et al. (1990) uses second-order gradient information to estimate the sensitivity of network performance to the removal of each weight, and removes the least critical weights. An older and simpler approach to removing excess weights from a network is to add an extra term to the error function that penalizes complexity: cost = data-misfit + X complexity

(1.1)

During learning, the network is trying to find a locally optimal tradeoff between the data-misfit (the usual error term) and the complexity of the net. The relative importance of these two terms can be estimated by finding the value of X that optimizes generalization to a validation set. Probably the simplest approximation to complexity is the sum of the squares of the weights, C iz$. Differentiating this complexity measure leads to simple weight decuy (Plaut et d. 1986) in which each weight 'This approach is not the way in which cross-validation is usually used in the statistics community. Usually a cross-validation set is used to determine the scale factor applied to a regularization term added to the optimization to prevent overfitting. We will see examples of this in the following sections.

Simplifying Neural Networks

475

decays toward zero at a rate that is proportional to its magnitude. This decay is countered by the gradient of the error term, so weights that are not critical to network performance, and hence always have small error gradients, decay away leaving only the weights necessary to solve the problem. At the end of learning, the magnitude of a weight is exactly proportional to its error derivative, which makes it particularly easy to interpret the weights (see, for example, Hinton 1986). Minimizing El wf is a well-known technique when fitting linear regression models that have too many degrees of freedom. One justification is that it minimizes the sensitivity of the output to noise in the input, since in a linear system the variance of the noise in the output is just the variance of the noise in the input multiplied by the squared weights (Kohonen 1977). The use of a El w;? penalty term can also be interpreted from a Bayesian perspective? The "complexity" of a set of weights, X El w?, may be described as its negative log probability density under a radially symmetric gaussian prior distribution on the weights. The distribution is centered at the origin and has variance 1/X. For multilayer networks, it is hard to find a good theoretical justification for this prior, but Hinton (1987) justifies it empirically by showing that it greatly improves generalization on a very difficult task. More recently, MacKay (1991) has shown that even better generalization can be achieved by using different values of X for the weights in different layers. 2 A More Complex Measure of Network Complexity

One potential drawback of Ciw? as a penalty term is that it can favor two weak interactions over one strong one. For example, if a unit receives input from two units that are highly correlated with each other, its behavior will be similar if the two connections have weights of w and 0 or weights of w / 2 and w/2. The penalty term favors the latter because

(;), + (;),

< w 2+ o2

(2.1)

If we wish to drive small weights toward zero without forcing large weights away from the values they need to model the data, we can use a prior which is a mixture of a narrow ( n )and a broad (b) gaussian, both centered at zero.

where T,, and r b are the mixing proportions of the two gaussians and are therefore constrained to sum to 1. *R. Szeliski, personal communication (1985).

476

Steven J. Nowlan and Geoffrey E. Hinton

Assuming that the weight values were generated from a gaussian mixture, the conditional probability that a particular weight, wi, was generated by a particular gaussian, j , is called the responsibility of that gaussian for the weight and is given by (2.3) where pj(wi) is the probability density of wi under gaussian j . When the mixing proportions of the two gaussians are comparable, the narrow gaussian gets most of the responsibility for a small weight. Adopting the Bayesian perspective, the cost of a weight under the narrow gaussian is proportional to d / Z u ; . As long as on is quite small there will be strong pressure to reduce the magnitude of small weights even further. Conversely, the broad gaussian takes most of the responsibility for large weight values, so there is much less pressure to reduce them. In the limiting case when the broad gaussian becomes a uniform distribution, there is almost no pressure to reduce very large weights because they are almost certainly generated by the uniform distribution. A complexity term very similar to this limiting case has been used successfully by Weigend et al. (1990) to improve generalization for a time series prediction task.4 There is an alternative justification for using a complexity term that is a mixture of a uniform distribution and a narrow, zero-mean gaussian. The negative log probability is approximately constant for large weights but smoothly approaches a much lower value as the weight approaches zero. So the complexity cost is a smoothed version of the obvious discrete cost function that has a value of zero for weights which are zero and a value of 1 for all other weights. This smoothed cost function is suitable for gradient descent learning, whereas the discrete one is not. 3 Adaptive Gaussian Mixtures and Soft Weight-Sharing

A mixture of a narrow, zero-mean gaussian with a broad gaussian or a uniform distribution allows us to favor networks with many near-zero weights, and this improves generalization on many tasks, particularly those in which there is some natural measure of locality that determines which units need to interact and which do not. But practical experience with hand-coded weight constraints has also shown that great improvements can be achieved by constraining particular subsets of the weights to share the same value (Lang et al. 1990; le Cun 1989). Mixtures of zeromean gaussians and uniforms cannot implement this type of symmetry constraint. If however, we use multiple gaussians and allow their means 3This is more commonly referred to as the posterior probability of gaussian j given weight wi. 4See Nowlan (1991) for a precise description of the relationship between mixture models and the model used by Weigend ef al. (1990).

Simplifying Neural Networks

477

and variances to adapt as the network learns, we can implement a "soft" version of weight-sharing in which the learning algorithm decides for itself which weights should be tied together. (We may also allow the mixing proportions to adapt so that we are not assuming all sets of tied weights are the same size.) If we know, in advance, that two connections should probably have the same weight, we can introduce a complexity penalty proportional to the squared difference between the two weights. But if we do not know which pairs of weights should be the same it is harder to see how to favor solutions in which the weights are divided into subsets and the weights within a subset are nearly identical. We now show that a mixture of gaussians model can achieve just this effect. The basic idea is that a gaussian that takes responsibility for a subset of the weights will squeeze those weights together since it can then have a lower variance and hence assign a higher probability density to each weight. If the gaussians all start with high variance, the initial division of weights into subsets will be very soft. As the variances shrink and the network learns, the decisions about how to group the weights into subsets are influenced by the task the network is learning to perform. 4 An Update Algorithm

To make the intuitive ideas of the previous section a bit more concrete, we may define a cost function of the general form given in (1.1): (4.1)

where vi is the variance of the squared error and each ,q(w,)is a gaussian density with mean p, and standard deviation a,. We will optimize this function using some form of gradient descent to adjust the w, as well as ., the mixture parameters 7rl, pl, and al, and a5 The partial derivative of C with respect to each weight is the sum of the usual squared error derivative and a term due to the complexity cost for the weight:

The derivative of the complexity cost term is simply a weighted sum of the difference between the weight value and the center of each of the "1/~7: may be thought of as playing the same role as X in equation 1.1 in determining a trade-off between the misfit and complexity costs. uy is re-estimated as learning proceeds so this trade-off is not constant. I< is a factor that adjusts for the effective number of degrees of freedom (or number of well determined parameters) in a problem. For the simulations described here its value was close to 1.0 and was determined by cross-validation.

478

Steven J. Nowlan and Geoffrey E. Hinton

gaussians. The weighting factors are the responsibility measures defined in equation 2.3 and if over time a single gaussian claims most of the responsibility for a particular weight the effect of the complexity cost term is simply to pull the weight toward the center of the responsible gaussian. The strength of this force is inversely proportional to the variance of the gaussian. Notice also that since the derivative of the complexity cost term is actually a sum of forces exerted by each gaussian, the net force exerted on the weight can be very small even when the forces exerted by some of the individual gaussians are quite large (e.g., a weight placed midway between two gaussians of equal variance has zero net force acting on it). This allows one to set up initial conditions in which each gaussian accounts quite well for at least some of the weights in the network, but the overall force on any weight due to the complexity term is negligible so the weights are initially driven primarily by the derivative of the datamisfit term. The partial derivatives of C with respect to the means and variances of the gaussians in the mixture have similar forms: (4.3) (4.4)

The derivative for simply drives pj toward the weighted average of the set of weights gaussian j is responsible for. Similarly the derivative for a, drives it toward the weighted average of the squared deviations of these weights about l ~ j .The derivation of the partial of C with respect to the mixing proportions is slightly less straightforward since we must worry about maintaining the constraint that the mixing proportions must sum to 1. Appropriate use of a Lagrange multiplier and a bit of algebraic manipulation leads to the simple expression: (4.5) Once again the result is intuitive; T, is moved toward the average responsibility of gaussian j for all of the weights. The partial derivatives of C with respect to each of the mixture parameters are simple enough that, for fixed values of the responsibilities, the exact minimizer can be found analytically with ease (for example, the minimizer for equation 4.3 is simply pl = El rl(wl)w,/E, r,(wl)).This suggests that one could proceed by simply recomputing the rl(w,)after each weight update and setting the p l , ol, and xlto their exact minimizers given the current rl(wt).In fact the process of recomputing the v,(wt)and then setting all the parameters to their analytic minimizers corresponds to one iteration of the EM algorithm applied to mixture estimation (Dempster et al. 1977). This is a sensible and quite efficient algorithm to use for

Simplifying Neural Networks

479

estimating the mixture parameters when we are dealing with a stationary data distribution. However, in the case we are considering it is clear that the "data" we are modeling, the set of wi,does not have a stationary distribution. In order to avoid stability problems, it is very important that the rate of change of our mixture parameters be tied to the rate of change of the weights themselves. For this reason, we choose to update all of the parameters (wi,pj, aj, T,) simultaneously using a conjugate gradient descent procedure.6 Before considering applications of the method outlined above we need to consider briefly the issue of initializing all of our parameters appropriately. It is well known that maximum likelihood methods for fitting mixtures can be very sensitive to poor initial conditions (McLachlan and Basford 1988). For example, if one component of a mixture initially has little responsibility for any of the weights in the network, its mixing proportion is driven rapidly toward zero and it is very difficult to recover from this situation. Fortunately, in the case of a network we usually know the initial weight distribution and so we can initialize the mixture appropriately. Commonly, we initialize the network weights so they are uniformly distributed over an interval [- W, w]. In this case we may initialize the means of the gaussians so they are spaced evenly over the interval [- W, w],and set all of the variances equal to the spacing between adjacent means and the mixing proportions equal to each other. This ensures that each component in the mixture initially has the same total responsibility over the entire set of weights,' and also produces sufficient counterbalance between the forces from each gaussian so most of the weights in the network initially receive very little net force from the complexity measure. This initialization procedure is used for all of the simulations discussed in this paper. There is one additional trick used in the simulations discussed in this paper. The variances of the mixture components, a!, must of course be restricted to be positive and in addition if the variance of any component is allowed to approach 0 too closely the likelihood may become unbounded. To maintain the positivity constraint and at the same time make it difficult for the variance of a component to approach 0, we define the variance of the components in terms of a set of auxiliary variables: (4.6)

where the value of rj is unrestricted. The gradient descent is performed on the set of 7, rather than directly on the aj.' 6Any method of gradient descent could be used in the parameter update, however the conjugate gradient technique is quite fast and avoids the need to tune optimization parameters such as step size or momentum rate. 7Thereis a minor edge effect for the two most extreme components. 'The use of the -y, may be thought of simply as a technique for getting a better conditioned optimization problem.

480

Steven J. Nowlan and Geoffrey E. Hinton

Figure 1: Shift detection network used for generalization simulations. The output unit was connected to a bias unit and the 10 hidden units. Each hidden unit was connected to the bias unit and to 8 input units, 4 from the first block of 10 inputs and the corresponding 4 in the second block of 10 inputs. The solid, dashed, and dotted lines show the group of input units connected to the first, second, and third hidden units, respectively. 5 Results on a Toy Problem

In this section, we report on some simulations which compare the generalization performance of networks trained using the cost criterion given in equation 4.1 to networks trained in three other ways: 0

No cost term to penalize complexity. No explicit complexity cost term, but use of a validation set to terminate learning.

0

The complexity cost term used by Weigend et al. (Weigend et al. (1990h9

The problem chosen for this comparison was a 20 input, one output shift detection network (see Fig. 1). The network had 20 input units, 10 hidden units, and a single output unit and contained 101 weights. The first 10 input units in this network were given a random binary pattern, and the second group of 10 input units were given the same pattern ~~~

'With a fixed value of X chosen by cross-validation.

Simplifying Neural Networks

481

Table 1: Summary of Generalization Performance of Five Different Training Techniques on the Shift Detection Problem. Method Train % correct Test % correct 100.0 f 0.0 67.3 f 5.7 Backpropagation Cross-validation 98.8 f 1.1 83.5 i5.1 100.0 f 0.0 89.8 f 3.0 Weight decay 100.0 f 0.0 95.6 f 2.7 Soft-share - 5 component 100.0 f 0.0 97.1 f2.1 Soft-share - 10 component

circularly shifted by 1 bit left or right. The desired output of the network was fl for a left shift and -1 for a right shift. A data set of 2400 patterns was created by randomly generating a 10 bit string, and choosing with equal probability to shift the string left or right.I0 The data set was divided into 100 training cases, 1000 validation cases, and 1300 test cases. The training set was deliberately chosen to be very small (< 5% of possible patterns) to explore the region in which complexity penalties should have the largest impact. Networks were trained with a conjugate gradient technique and, except for the networks trained using cross-validation, training was stopped as soon as 100% correct performance was achieved on the training set. For the networks trained with cross-validation, training was stopped when three consecutive weight updates" produced an increase in the error on the validation set and the weights were then reset to the weights which achieved the lowest error on the validation set before testing for generalization. For the technique described in Weigend et al. (1990), X = 5.0 x lo-' in all simulations." Simulations using equation 4.1 were performed with gaussian mixtures containing 5 and 10 components. Each component had its own mean (pj), variance (of), and mixing proportion (7rj). The parameters of the mixture distribution were continuously reestimated as the weights were changed as was the normalizing factor for the squared error by). Ten simulations were performed with each method, starting from ten different initial weight sets (i.e., each method used the same ten initial weight configurations). The simulation results are summarized in Table 1. The first column indicates the method used in training the network, while the second and third columns present the performance on the training and test sets respectively (plus or minus one standard deviation). '"Since there are only 2048 distinct cases, this set of 2400 did contain some duplicates. "A weight update refers to the final weight change accepted at the end of a single line search. "This value was selected by performing simulations with X ranging between 1.0 x lo-' and 1.0 x and choosing the value of X that gave the best performance on the cross-validation set.

482

Steven J. Nowlan and Geoffrey E. Hinton

All three methods which employ some form of weight modeling performed significantly better on the test set than networks trained using backpropagation without a complexity penalty ( p >> 0.9999).13 The network trained with cross-validation also performs better than a network trained without a complexity penalty ( p > 0.995). The two soft-sharing models perform better than cross-validation ( p > 0.995 for the 5 component mixture, p > 0.999 for the 10 component mixture). The evidence that the form of weight decay of Weigend et al. is superior to cross-validation on this problem is very weak ( p < 0.9). Finally, the two soft-sharing models are significantly better than the weight decay model ( p > 0.999 for the 10 component mixture and p > 0.99 for the 5 component mixture). The difference in the performance of the 5 and 10 component mixtures is not significant. A typical set of weights learned by the soft-sharing model with a 10 component mixture is shown in Figure 2 and the final mixture density is shown in Figure 3. The weight model in this case contains four primary weight clusters: large magnitude positive and negative weights and small magnitude positive and negative weights. These four distinct classes may also be seen clearly in the weight diagram (Fig. 2). What is perhaps most interesting about the mixture probability density shown in Figure 3 is that it does not have a significant component with mean 0. The classical assumption that the network contains a large number of inessential weights that can be eliminated to improve generalization is not appropriate for this problem and network architecture. This may explain why the weight decay model used by Weigend et al. (1990) performs relatively poorly in this situation. 6 Results on a Real Problem

The second task chosen to evaluate the effectiveness of the cost criterion of equation 4.1 was the prediction of the yearly sunspot average from the averages of previous years. This task has been well studied as a time-series prediction benchmark in the statistics literature (Priestley 1991a,b) and has also been investigated by Weigend et al. (1990) using a cost criterion similar to the one discussed in Section 2. The network architecture used was identical to the one used in the study by Weigend et al. The network had 12 input units, which represented the yearly average from the preceding 12 years, 8 hidden units, and a single output unit, which represented the prediction for the average number of sunspots in the current year. Yearly sunspot data from 1700 to 1920 was used to train the network to perform this one-step prediction task, and the evaluation of the network was based on data from 1921 to "'Al statistical comparisons are based on a t test with 19 degrees of freedom. p denotes the probability of rejecting the hypothesis that the two samples being compared have the same mean value.

Simplifying Neural Networks

483

.....

. .

Figure 2: A diagram of weights discovered for the shift problem by a model which employed a 10 component mixture for the complexity cost. Black squares are negative weights and white squares are positive, with the size of the square proportional to the magnitude of the weight. Weights are shown for all 10 hidden units. The bottom row of each block represents the bias, the next two rows are the weights from the 20 input units, and the top row is the weight to the output unit.

1955.14 The evaluation of prediction performance used the average relative variance ( a m ) measure discussed in Weigend et al. (1990):

am(S)=

&s(target, - prediction,)’ CkEs(targetk- means)*

(6.1)

where S is a set of target values and means is the average of those target values. Simulations were performed using the same conjugate gradient method used in the previous section. Complexity measures based on gaussian mixtures with 3 and 8 components were used and 10 simulations were performed with each (using the same training data but different initial weight configurations). The results of these simulations are summarized in Table 2 along with the best result obtained by Weigend et al. (1990) (WRH), the bilinear autoregression model of Tong and Lim (1980) I4The authors wish to thank Andreas Weigend for providing his version of this data to work with.

Steven J. Nowlan and Geoffrey E. Hinton

484

1.4

1.3 1.2

1.1 1

.

0 3

n.Fc 0.7

V.fi

0.5 0.4

0.3 0.2

0.1

Figure 3: Final mixture probability density for the set of weights shown in Figure 2. Five of the components in the mixture can be seen as distinct bumps in the probability density. Of the remaining five components, two have been eliminated by having their mixing proportions go to zero and the other three are very broad and form the baseline offset of the density function. (TAR),I5and the multilayer RBF network of He and Lapedes (1991)(RBF). All figures represent the mu on the test set. For the mixture complexity models, this is the average over the 10 simulations, plus or minus one standard deviation. Since the results for the models other than the mixture complexity trained networks are based on a single simulation it is difficult to assign statistical significance to the differences shown in Table 2. We may note however, that the difference between the 3 and 8 component mixture complexity models is significant ( p > 0.95) and the differences between the 8 component model and the other models are much larger. Weigend et a!. point out that for time series prediction tasks such as the sunspot task a much more interesting measure of performance is the ability of the model to predict more than one time step into the future. 15This was the model favored by Priestley (1991a) in a recent evaluation of classical statistical approaches to this task.

Simplifying Neural Networks

485

Table 2: Summary of average relative variance of five different models on the one-step sunspot prediction problem. Method TAR RBF WRH Soft-share - 3 Comp. Soft-share - 8 Comp.

Test arv 0.097 0.092 0.086 0.077 f0.0029 0.072 f 0.0022

One way to approach the multistep prediction problem is to use iterated single-step prediction. In this method, the predicted output is fed back as input for the next prediction and all other input units have their values shifted back one unit. Thus the input typically consists of a combination of actual and predicted values. We define the predicted value for time t, obtained after 1 iterations to be it,1. The prediction error will depend not only on I but also on the time ( t - I ) when the iteration was started. In order to account for both effects, Weigend e f al. suggested the average relative 1-times iterated prediction variance as a performance measure for iterated prediction:

where M is the number of different start times for iterated prediction and 6 is the estimated standard deviation of the set of target values. In Figure 4 we plot this measure (computed over the test set from 1921 to 1955) as a function of the number of prediction iterations for the simulations using the 3 and 8 component complexity measures, the Tong and Lim model (TAR), and the model from Weigend et al., which produced the lowest single step am (WRH). The plots for the 3 and 8 component complexity models are the averages over 10 simulations with the error bars indicating the plus or minus one standard deviation intervals. Once again, the differences between the 3 and 8 component models are significant for all numbers of iterations. The differences between the adaptive gaussian complexity measure and the fixed complexity measure used by Weigend et al. are not as dramatic on the sunspot task as they were in the shift detection task. The explanation for this may be seen in Figures 5 and 6, which show a typical set of weights learned by the soft-sharing model with 8 mixture components and the corresponding final mixture probability density. The distinct weight groups seen clearly in the shift detection task (Fig. 2) are not as apparent in the weights for the sunspot task and the final weight distribution for the sunspot task is very smeared out except for one very strong sharp component near 0. It is clear that the fixed model assumed

Steven J. Nowlan and Geoffrey E. Hinton

486

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3 0-2 0.1

(

Figure 4: Average relative I-times iterated prediction variance versus number of prediction iterations for the sunspot time series from 1921 to 1955. Closed circles represent the TAR model, open circles the WRH model, closed squares the 3 component complexity model, and open squares the 8 component complexity model. One deviation error bars are shown for the 3 and 8 component complexity models.

by Weigend et al. is much more appropriate for the sunspot prediction task than it was for the shift detection task.

7 A Minimum Description Length Perspective

A number of authors have suggested that when attempting to approximate a n unknown function with some parametric approximation scheme (such as a network), the proper measure to optimize combines an estimate of the cost of the misfit with a n estimate of the cost of describing the parametric approximation (Akaike 1973; Rissanen 1978; Barron and Barron 1988). Such a measure is often referred to as a minimum description

Simplifying Neural Networks

487

Figure 5 A diagram of weights discovered for the sunspot prediction problem by a model which employed an 8 component mixture for the complexity cost. Weights are shown for all 8 hidden units. For each unit, the weights coming from the 12 inputs are shown in a row with the single weight to the output immediately above the row. The biases of the hidden units, which are not shown, were, with one exception, small negative numbers very close in value to most of the other weights in the network. The first three units in the left column all represent the simple rule that the number of sunspots depends on the number in the previous year. The last two units in this column compute a simple moving average. The three units on the right represent more interesting rules. The first captures the 11 year cycle, the second recognizes when a peak has just passed, and third appears to prevent the prediction from rising too soon if a peak happened 9 years ago and the recent activity is low.

length criterion (MDL), and typically has the general form

MDL

=

c

messages

- log p(message)

+

c

- logp(parameter).

parameters

For a supervised network, the parameters are the weights and the messages are the desired outputs. If we assume that the output errors are gaussian and that the weights are encoded using a mixture of gaussians probability model the description length is approximated by equation 4.1. The expression in equation 4.1 does not include the cost of encoding the means and variances of the mixture components or the mixing proportions of the mixture density. Since the mixture usually contains a small number of components (fewer than 10 usually) and there are only three parameters associated with each component, the cost of encoding these parameters is negligible compared to the cost of encoding the

488

Steven J. Nowlan and Geoffrey E. Hinton

Figure 6: Final mixture probability density for the set of weights shown in Figure 5. The density is dominated by a narrow component centered very near zero, with the remaining components blending into a skewed distribution with a peak around 0.5. weights in most networks of interest.16 In addition, since the number of components in the distribution does not change during the optimization, if the component parameters are all encoded with the same fixed precision, the cost of the mixture parameters is simply a constant offset, which is ignored in the optimi~ation.'~ There is one important aspect of estimating the cost of describing a weight that we have ignored. We have assumed that the cost of a weight is the negative logarithm of a probability density function evaluated at the weight value, but this ignores the accuracy with which the weight must be described. We are really interested in the probability mass of a particular small interval of values for the parameter, and this means that we should integrate our density function over this interval to estimate the cost of each weight. We have implicitly assumed that this integration 161norder to provide enough data to fit the mixture density, one should have an order of magnitude more weights than components in the mixture. "This ignores the possibility of not encoding the parameters of components whose mixing proportions approach 0.

Simplifying Neural Networks

489

region has the same (infinitesimal) width for every weight, and so the probability of a weight is simply proportional to the density function. This ignores the fact that most networks are generally much more sensitive to small changes in some weight values than others, so some weights need to be encoded more accurately than others.I8 The sensitivity of a network to a small change in a weight is determined by the curvature of the error surface. One could evaluate the curvature by computing the Hessian and make the width of the integration region for each weight inversely proportional to the curvature along each weight dimension. To be perfectly accurate, one would need to integrate the joint probability density function for all of the weights over a region determined by the complete Hessian (since the directions of maximum curvature are often not perfectly aligned with the weight axes). This process would be computationally very costly, and an adequate approximation might be obtainable by using a diagonal approximation to the Hessian and treating each weight independently (as advocated by le Cun et al. 1990). We see no reason why our method of estimating the probability density should not be combined with a method for estimating the integration interval. For small intervals, this is particularly easy since the probability mass is approximately the width of the interval times the height of the density function so these two terms are additive in the log probability domain. 8 A Bayesian Perspective

As a number of authors have recently pointed out (Weigend et al. 1991; Nowlan 1991; MacKay 1991; Buntine and Weigend 1991), equation 1.1 can be derived from the principles of Bayesian inference. If we have a set of models MI,M 2 )... , which are competing to account for the data, Bayesian inference is concerned with how we should update our belief in the relative plausibility of each of these models in light of the data D. If P(M; I D ) is the plausibility of model M; given we have observed D, Bayes rule states

where P ( D I Mi) is a measure of how well model i predicts the data and P ( M , ) is our belief in the plausibility of model i before we have seen any data. Here P ( D ) is simply a normalizing factor to ensure our beliefs add up to one. If we are only interested in comparing alternate models, P ( D ) can be ignored and in the log domain equation 8.1 becomes equation 1.1 with the data-misfit cost equal to logP(D I Mi) and the complexity ''Other things being equal, we should prefer networks in which the outputs are less sensitive to the precise weight values, since then the weight values can be encoded imprecisely without causing large output errors.

490

Steven J. Nowlan and Geoffrey E. Hinton

cost equal to logP(M,). If we are only considering a single network architecture, P(Mj) becomes a prior distribution over the set of possible weights.’’ What equation 8.1 highlights is that in the Bayesian framework our complexity cost should be independent of our data. This is certainly true when the complexity is the sum of the squares of the weights, and also holds for models such as the one used by Weigend et al. However, the mixture densities discussed in this paper are clearly not independent of the data and cannot be regarded as classical Bayesian priors. The complexity cost we are using corresponds more closely to a Bayesian hyperprior (Jaynes 1986; Gull 1988). We have specified a particular family of distributions from which the prior will be drawn but have left the parameters of the prior (T,, p,, a,) undetermined. Members of this family of distributions have the common feature of favoring sets of weights in which the weights in a set are clustered about a small number of values.20When using a hyperprior, we can deal with the hyperparameters either by marginalizing over them (in effect, integrating them out) (Buntine and Weigend 19911, or by allowing the data (i.e., the weights) to determine their values a posteriori.” We have used this second approach, which is advocated by Gull (1988), who has shown that the use of such flexible hyperpriors can lead to considerable improvement in the quality of image reconstructions (Gull 1989; Skilling 1989) compared to the use of more classical priors. The trick of optimizing 7, rather than a, (discussed at the end of Section 4) may also be justified within the Bayesian framework. To estimate our hyperparameters, we should properly specify prior distributions for each. If these priors are uninformative:’ then the estimated values of the hyperparameters are determined entirely by the data. A parameter like a, is known as a scale parameter (it affects the width of the distribution) while parameters like pj are known as location parameters (they affect the position of the distribution). (See Jeffreys 1939 for further discussion.) An uninformative prior for a location parameter is uniform in the parameter, but an uninformative prior for a scale parameter is uniform in the log of the parameter ke., uniform in 7, rather than a,,Gull 1988). It is more consistent from this perspective to treat 7 and pj similarly, rather than u,and pj.

”Much more interesting results are obtained when we apply this framework to making choices among many architectures, see MacKay for some elegant examples (MacKay 1991). 20Thelocations of these clusters will generally be different for different sets of weights. 211nprinciple, both approaches will lead to the same posterior distribution over the weights and the same ultimate choice of weights for the network. The difference lies in whether we are searching over a joint space of weights and hyperparameters or using prior analytic ,simplifications to reduce the search to some manifold in weight space alone. =A prior that contains no initial bias except for a possible range constraint.

Simplifying Neural Networks

491

9 Summary

The simulations we have described provide evidence that the use of a more sophisticated model for the distribution of weights in a network can lead to better generalization performance than a simpler form of weight decay, or techniques that control the learning time. The better generalization performance comes at the cost of greater complexity in the optimization of the weights. The effectiveness of the technique is likely to be somewhat problem dependent, but one advantage offered by the more sophisticated model is its ability to automatically adapt the model of the weight distribution to individual problems.

Acknowledgments This research was funded by grants from the Ontario Information Technology Research Center and the Canadian Natural Science and Engineering Research Council and the Howard Hughes Medical Institute. Hinton is the Noranda fellow of the Canadian Institute for Advanced Research.

References Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Proceedings 2nd International Symposium on Information Theory, B. N. Petrov and E Csaki, eds., pp. 267-281. Akademia Kiado, Budapest, Hungary. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In 1988 Symposium on the Interface: Statistics and Computing Science, Reston, Virginia, April 21-23. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Systems 5(6), 603-643. Dempster, A. P., Laird, N. M., and Rubin, D. 8. 1977. Maximum likelihood from incomplete data via the EM algorithm. Proc. R. Stat. SOC.Ser. B 39, 1-38. Gull, S. E 1988. Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering, G. J. Erickson and C. R. Smith, eds. Kluwer Academic, Dordrecht, The Netherlands. Gull, S. F. 1989. Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods (8th Workshop), J. Skilling, ed., pp. 53-71. Kluwer Academic, Dordrecht, The Netherlands. He, X., and Lapedes, A. 1991. Nonlinear Modelling and Prediction by Successive Approximation Using Radial Basis Functions. Tech. Rep. LA-UR-91-1375, Los Alamos National Laboratory.

492

Steven J. Nowlan and Geoffrey E. Hinton

Hinton, G. E. 1986. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 1-12. Erlbaum, Hillsdale, NJ. Hinton, G. E. 1987. Learning translation invariant recognition in a massively parallel network. In Proc. Conf. Parallel Architectures and Languages Europe, pp. 1-13. Eindhoven, The Netherlands. Jaynes, E. T. 1986. Bayesian methods: General background. In Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25. Cambridge University Press, Cambridge. Jeffreys, H. 1939. Theory of Probability. Oxford University Press, Oxford. Later editions 1948, 1961, 1983. Kohonen, T. 1977. Associative Memoy: A System-Theoretical Approach. Springer, Berlin. Lang, K. J., Waibel, A. H., and Hinton, G. E. 1990. A time-delay neural network architecture for isolated word recognition. Neural Networks 3, 2343. LeCun, Y. 1987. Modeles connexionnistes de l’apprentissage. Ph.D. thesis, Universite Pierre et Marie Curie, Paris, France. LeCun, Y. 1989. Generalization and Network Design Strategies. Tech. Rep. CRGTR-89-4, University of Toronto. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 39-04. Morgan Kaufmann, San Mateo, CA. LeCun, Y., Denker, J., Solla, S., Howard, R. E., and Jackel, L. D. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2,598-605, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1991. Bayesian modeling and neural networks. Ph.D. thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA. McLachlan, G. J., and Basford, K. S. 1988. Chapters 1 and 2. In Mixture Models: Inference and Applications to Clustering, G. J. McLachlan and K. E. Basford, eds., pp. 1-69. Marcel Dekker, New York. Morgan, N., and Bourlard, H. 1989. Generalization and Parameter Estimation in FeedforwardNets: Some Experiments. Tech. Rep. TR-89-017, International Computer Science Institute, Berkeley, CA. Mozer, M. C., and Smolensky, P. 1989. Using relevance to reduce network size automatically. Connection Sci. 1(1),3-16. Nowlan, S. J. 1991. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on Learning by Backpropagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh, PA. Priestley, M. B. 1991a. Non-linear and Non-stationary Time Series Analysis. Academic Press, San Diego. Priestley, M. 8. 1991b. Spectral Analysis and Time Series. Academic Press, San Diego.

Simplifying NeuraI Networks

493

Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465471. Rumelhart, D. E., McClelland, J. L., and the PDP research group. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vols. I and 11. MIT Press, Cambridge, MA. Skilling, J. 1989. Classic maximum entropy. In Maximum Entropy and Bayesian Methods (8th Workshop), J. Skilling, ed. Kluwer Academic, Dordrecht, The Netherlands. Tong, H., and Lim, K. S. 1980. Threshold autoregression, limit cycles, and cyclical data. I. R. Stat. SOC.Ser. B 42, 245-253. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1990. Predicting the future: A connectionist approach. In Proceedings of the 1990 Connectionist Models Summer School, T. J. Sejnowski, G. E. Hinton, and D. S. Touretzky, eds., pp. 105-116. Morgan Kaufmann, San Mateo, CA. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. ~~

Received 15 October 1991; accepted 2 December 1991.

This article has been cited by: 1. Huisheng Zhang, Wei Wu, Fei Liu, Mingchen Yao. 2009. Boundedness and Convergence of Online Gradient Method With Penalty for Feedforward Neural Networks. IEEE Transactions on Neural Networks 20:6, 1050-1054. [CrossRef] 2. Gursel Serpen. 2008. Hopfield Network as Static Optimizer: Learning the Weights and Eliminating the Guesswork. Neural Processing Letters 27:1, 1-15. [CrossRef] 3. Kazumi Saito, Ryohei Nakano. 2007. Bidirectional clustering of weights for neural networks with common weights. Systems and Computers in Japan 38:10, 46-57. [CrossRef] 4. Jung Hoon Lee, Konstantin K. Likharev. 2007. Defect-tolerant nanoelectronic pattern classifiers. International Journal of Circuit Theory and Applications 35:3, 239-264. [CrossRef] 5. N. Garcia-Pedrajas, C. Hervas-Martinez, D. Ortiz-Boyer. 2005. Cooperative Coevolution of Artificial Neural Network Ensembles for Pattern Classification. IEEE Transactions on Evolutionary Computation 9:3, 271-302. [CrossRef] 6. Kenji Suzuki. 2004. Determining the receptive field of a neural filter. Journal of Neural Engineering 1:4, 228-237. [CrossRef] 7. A. Sfetsos, C. Siriopoulos. 2004. Time Series Forecasting with a Hybrid Clustering Scheme and Pattern Recognition. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 34:3, 399-405. [CrossRef] 8. Anil Kumar Ghosh, Smarajit Bose. 2004. Backfitting neural networks. Computational Statistics 19:2, 193-210. [CrossRef] 9. E. Trentin, M. Gori. 2003. Robust combination of neural networks and hidden markov models for speech recognition. IEEE Transactions on Neural Networks 14:6, 1519-1531. [CrossRef] 10. J. Ghosn, Yoshua Bengio. 2003. Bias learning, knowledge sharing. IEEE Transactions on Neural Networks 14:4, 748-765. [CrossRef] 11. C. Alippi, C. de Russis, V. Piuri. 2003. A neural-network based control solution to air-fuel ratio control for automotive fuel-injection systems. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 33:2, 259-268. [CrossRef] 12. Liu Yong, Zou Xiu-fen. 2003. From designing a single neural network to designing neural network ensembles. Wuhan University Journal of Natural Sciences 8:1, 155-164. [CrossRef] 13. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 14. Shun-Feng Su, Chan-Ben Lin, Yen-Tseng Hsu. 2002. A high precision global prediction approach based on local prediction approaches. IEEE Transactions on

Systems, Man and Cybernetics, Part C (Applications and Reviews) 32:4, 416-425. [CrossRef] 15. Fi-John Chang, Hsiang-Fan Hu, Yen-Chang Chen. 2001. Counterpropagation fuzzy-neural network for streamflow reconstruction. Hydrological Processes 15:2, 219-232. [CrossRef] 16. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 17. Zhimin Huo, Maryellen L Giger, Charles E Metz. 1999. Physics in Medicine and Biology 44:10, 2579-2595. [CrossRef] 18. B. Igelnik, Yoh-Han Pao, S.R. LeClair, Chang Yun Shen. 1999. The ensemble approach to neural-network learning and generalization. IEEE Transactions on Neural Networks 10:1, 19-30. [CrossRef] 19. Sheng Ma, Chuanyi Ji. 1999. Performance and efficiency: recent advances in supervised learning. Proceedings of the IEEE 87:9, 1519-1535. [CrossRef] 20. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef] 21. Michael D. Lee . 1998. Neural Feature Abstraction from Judgments of SimilarityNeural Feature Abstraction from Judgments of Similarity. Neural Computation 10:7, 1815-1830. [Abstract] [PDF] [PDF Plus] 22. Kwabena Agyepong, Ravi Kothari. 1997. Controlling Hidden Layer Capacity Through Lateral ConnectionsControlling Hidden Layer Capacity Through Lateral Connections. Neural Computation 9:6, 1381-1402. [Abstract] [PDF] [PDF Plus] 23. Nathan Intrator§, Shimon Edelman. 1997. Learning low-dimensional representations via the usage of multiple-class labels. Network: Computation in Neural Systems 8:3, 259-281. [CrossRef] 24. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 25. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 26. Tsung-Nan Lin, C.L. Giles, B.G. Horne, Sun-Yuan Kung. 1997. A delay damage model selection algorithm for NARX neural networks. IEEE Transactions on Signal Processing 45:11, 2719-2730. [CrossRef] 27. Chuanyi Ji, Sheng Ma. 1997. Combinations of weak classifiers. IEEE Transactions on Neural Networks 8:1, 32-42. [CrossRef] 28. X.M. Gao, X.Z. Gao, J.M.A. Tanskanen, S.J. Ovasaka. 1997. Power prediction in mobile communication systems using an optimal neural-network structure. IEEE Transactions on Neural Networks 8:6, 1446-1455. [CrossRef]

29. Klaus Prank, Mirko Kloppstech, Steven Nowlan, Terrence Sejnowski, Georg Brabant. 1996. Random Secretion of Growth Hormone in Humans. Physical Review Letters 77:9, 1909-1911. [CrossRef] 30. Peter M. Williams. 1996. Using Neural Networks to Model Conditional Multivariate DensitiesUsing Neural Networks to Model Conditional Multivariate Densities. Neural Computation 8:4, 843-854. [Abstract] [PDF] [PDF Plus] 31. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 32. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus] 33. Colin Campbell , C. Perez Vicente . 1995. The Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural NetworksThe Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural Networks. Neural Computation 7:6, 1245-1264. [Abstract] [PDF] [PDF Plus] 34. R Naimimohasses, D M Barnett, D A Green, P R Smith. 1995. Measurement Science and Technology 6:9, 1291-1300. [CrossRef] 35. Gerson Lachtermacher, J. David Fuller. 1995. Back propagation in time-series forecasting. Journal of Forecasting 14:4, 381-393. [CrossRef] 36. Mohammad Bahrami. 1995. Issues on representational capabilities of artificial neural networks and their implementation. International Journal of Intelligent Systems 10:6, 571-579. [CrossRef] 37. Hong Pi , Carsten Peterson . 1994. Finding the Embedding Dimension and Variable Dependencies in Time SeriesFinding the Embedding Dimension and Variable Dependencies in Time Series. Neural Computation 6:3, 509-520. [Abstract] [PDF] [PDF Plus] 38. B. Apolloni, G. Ronchini. 1994. Dynamic sizing of multilayer perceptrons. Biological Cybernetics 71:1, 49-63. [CrossRef] 39. Nathan Intrator . 1993. Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural NetworksCombining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks. Neural Computation 5:3, 443-455. [Abstract] [PDF] [PDF Plus] 40. Marcus FreanConnectionist Architectures: Optimization . [CrossRef]

Communicated by David MacKay

NOTE

Exact Calculation of the Hessian Matrix for the Multilayer Percep tron Chris Bishop Neural Networks Group, AEA Technology, Harwell Laboratory, Oxfordshire, OX11 ORA, United Kingdom

The elements of the Hessian matrix consist of the second derivatives of the error measure with respect to the weights and thresholds in the network. They are needed in Bayesian estimation of network regularization parameters, for estimation of error bars on the network outputs, for network pruning algorithms, and for fast retraining of the network following a small change in the training data. In this paper we present an extended backpropagation algorithm that allows all elements of the Hessian matrix to be evaluated exactly for a feedforward network of arbitrary topology. Software implementation of the algorithm is straightforward. 1 Introduction

Standard training algorithms for the multilayer perceptron use backpropagation to evaluate the first derivatives of the error function with respect to the weights and thresholds in the network. There are, however, several situations in which it is also of interest to evaluate the second derivatives of the error measure. These derivatives form the elements of the Hessian matrix. Second derivative information has been used to provide a fast procedure for retraining a network following a small change in the training data (Bishop 1991). In this application it is important that all elements of the Hessian matrix be evaluated accurately. Approximations to the Hessian have been used to identify the least significant weights as a basis for network pruning techniques (Le Cun et d.1990), as well as for improving the speed of training algorithms (Becker and Le Cun 1988; Ricotta et al. 1988). The Hessian has also been used by MacKay (1991) for Bayesian estimation of regularization parameters, as well as for calculation of error bars on the network outputs and for assigning probabilities to different network solutions. MacKay found that the approximation scheme of Le Cun et al. (1990) was not sufficiently accurate and therefore included off-diagonal terms in the approximation scheme. Neural Computation

4, 494-501 (1992)

@ 1992 Massachusetts Institute of Technology

Exact Calculation of the Hessian Matrix

495

In this paper we show that the elements of the Hessian matrix can be evaluated exactly using multiple forward propagation through the network, followed by multiple backward propagation. The resulting algorithm is closely related to a technique for training networks whose error functions contain derivative terms (Bishop 1990). In Section 2 we derive the algorithm for a network of arbitrary feedforward topology, in a form that can readily be implemented in software.. The algorithm simplifies somewhat for a network having a single hidden layer, and this case is described in Section 3. Finally a brief summary is given in Section 4. 2 Evaluation of the Hessian Matrix

Consider a feedforward network in which the activation zi of the ith unit is a nonlinear function of the input to the unit: zi

= f (ai)

in which the input of other units

(2.1) ai

is given by a weighted linear sum of the outputs

where wi, is the synaptic weight from unit j to unit i, and Bi is a bias associated with unit i. Since the bias terms can be considered as weights from an extra unit whose activation is fixed at zk = fl, we can simplify the notation by absorbing the bias terms into the weight matrix, without loss of generality. We wish to find the first and second derivatives of an error function E, which we take to consist of a sum of terms, one for each pattern in the training set, P

where p labels the pattern. The derivatives of E are obtained by summing the derivatives obtained for each pattern separately. To evaluate the elements of the Hessian matrix, we note that the units in a feedforward network can always be arranged in "layers," or levels, for which there are no intralayer connections and no feedback connections. Consider the case in which unit i is in the same layer as unit n, or in a lower layer (i.e., one nearer the input). The remaining terms, in which unit i is above unit n, can be obtained from the symmetry of the Hessian matrix without further calculation. We first write

Chris Bishop

496

where we have made use of equation 2.2. The first equality in equation 2.4 follows from the fact that, as we shall see later, the first derivative only through ai. We now introduce a set of quantities g n depends on w,, defined by (2.5) Note that these are the quantities that are used in standard backpropagation. The appropriate expressions for evaluating them will be obtained shortly. Equation 2.4 then becomes

where again we have used equation 2.2. We next define the quantities

(2.8)

The second derivatives can now be written in the form

where f’(a) denotes d f l d a . The {gri} can be evaluated from a forward propagation equation obtained as follows. Using the chain rule for partial derivatives we have 8ar dat (2.10) g/, = --

c dai

da,

where the sum runs over all units r that send connections to unit 1. (In fact, contributions only arise from units which lie on paths connecting unit i to unit 1.) Using equations 2.1 and 2.2 we then obtain the forward propagation equation (2.11)

The initial conditions for evaluating the {ga} follow from the definition of equation 2.7, and can be stated as follows. For each unit i in the network, (except for input units, for which the corresponding {gli} are not required), set gii = 1 and set gli = 0 for all units I # i, which are in the same layer as unit i or which are in a layer below the layer containing unit i. The remaining elements of gli can then be found by forward propagation using equation 2.11. The number of forward passes needed to evaluate all elements of {gli} will depend on the network topology, but will typically scale like the number of (hidden plus output) units in the network.

Exact Calculation of the Hessian Matrix

497

The quantities {a,} can be obtained from the following backpropagation procedure. Using the definition in equation 2.5, together with the chain rule, we can write (2.12)

where the sum runs over all units Y to which unit n sends connections. Using equations 2.1 and 2.2 then gives (2.13)

This is just the familiar backpropagation equation. Note that the first derivatives of the error function are given by the standard expression

dE,- ajzj -

(2.14)

dWij

which follows from equations 2.2 and 2.5. The initial conditions for evaluation of the {a,} are given, from equations 2.2 and 2.5, by (2.15)

where rn labels an output unit. Similarly, we can derive a generalized backpropagation equation that allows the {bni} to be evaluated. Substituting the backpropagation formula 2.13 for the { a n }into the definition of bni, equation 2.8, we obtain (2.16)

which, using equations 2.7 and 2.8, gives bni = f ” ( a n ) g n i

Cr wrnar + f’(0,) Cr wrvbri

(2.17)

where again the sum runs over all units r to which unit n sends connections. Note that, in a software implementation, the first summation in equation 2.17 will already have been computed in evaluating the {a,} in equation 2.13. The derivative dldui that appears in equation 2.16 arose from the derivative d/aw, in equation 2.4. This transformation, from w, to ui, is valid provided w, does not appear explicitly within the brackets on the right-hand side of equation 2.16. This is always the case, because we considered only units i in the same layer as unit n, or in a lower layer. Thus the weights wrn are always above the weight wij and so the term dw,/awii is always zero.

Chris Bishop

498

The initial conditions for the backpropagation in equation 2.17 follow from equations 2.7, 2.8, and 2.15,

where we have defined (2.19)

Thus, for each unit i (except for the input units), the b,i corresponding to each output unit m are evaluated using equations 2.18 and 2.19, and then the b,i for each remaining unit n (except for the input units, and units n that are in a lower layer than unit i) are found by backpropagation using equation 2.17. Before using the above equations in a software implementation, the appropriate expressions for the derivatives of the activation function should be substituted. For instance, if the activation function is given by the sigmoid: 1

f ( a ) = 1 + exp(-a)

(2.20)

then the first and second derivatives are given by (2.21) For the case of linear output units, we have f ( u ) = u, f‘(u) = 1, and f”(a) = 0, with corresponding simplification of the relevant equations. Similarly, appropriate expressions for the derivatives of the error function with respect to the output unit activations should be substituted into equations 2.15 and 2.19. Thus, for the sum of squares error defined by 1

(2.22)

where t, is the target value for output unit m, the required derivatives of the error become (2.23) Another commonly used error measure is the relative entropy (Solla et ul. 1988) defined by

Ep=C{tm1nzm+(~-tm)In(~-zm)} m

(2.24)

Exact Calculation of the Hessian Matrix

499

The derivatives of Ep take a particularly elegant form when the activation function of the output units is given by the sigmoid of equation 2.20. In this case, we have, from equations 2.15, 2.19, and 2.21,

To summarize, the evaluation of the terms in the Hessian matrix can be broken down into three stages. For each pattern p , the {zn} are calculated by forward propagation using equations 2.1 and 2.2, and the {gli} are obtained by forward propagation using equation 2.11. Next, the {on} are found by backpropagation using equations 2.13 and 2.15, and the {b,i} are found by backpropagation using equations 2.17, 2.18, and 2.19. Finally, the second derivative terms are evaluated using equation 2.9. (If one or both of the weights is a bias, then the correct expression is obtained simply by setting the corresponding activationts) to +1.) These steps are repeated for each pattern in the training set, and the results summed to give the elements of the Hessian matrix. The total number of distinct forward and backward propagations required (per training pattern) scales like the number of (hidden plus output) units in the network, with the number of operations for each propagation scaling like N , where N is the total number of weights in the network. Evaluation of the elements of the Hessian using equation 2.9 requires of order fl operations. Since the number of weights is typically much larger than the number of units, the overall computation will be dominated by the evaluations in equation 2.9. 3 Single Hidden Layer Many applications of feedforward networks make use of an architecture having a single layer of hidden units, with full interconnections between adjacent layers, and no direct connections from input units to output units. Since there is some simplification to the algorithm for such a network, we present here the explicit expressions for the second derivatives. These follow directly from the equations given in Section 2. We shall use indices k and k‘ for units in the input layer, indices I and I’ for units in the hidden layer, and indices m and rn’ for units in the output layer. The Hessian matrix for this network can be considered in three separate blocks as follows. (A) Both weights in the second layer:

(B) Both weights in the first layer:

Chris Bishop

500

(C) One weight in each layer:

(3.3) where H, is defined by equation 2.19. If one or both of the weights is a bias term, then the corresponding expressions are obtained simply by setting the appropriate unit activation(s) to fl.

4 Summary In this paper, we have derived a general algorithm for the exact evaluation of the second derivatives of the error function, for a network having arbitrary feedforward topology. The algorithm involves successive forward and backward propagations through the network, and is expressed in a form that allows for straightforward implementation in software. The number of forward and backward propagations, per training pattern, is at most equal to twice the number of (hidden plus output) units in the network, while the total number of multiplications and additions scales like the square of the number of weights in the network. For networks having a single hidden layer, the algorithm can be expressed in a particularly simple form. Results from a software simulation of this algorithm, applied to the problem of fast network retraining, are described in Bishop (1991).

References Becker, S., and LeCun, Y. 1988. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the Connectionist Models Summer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, eds., p. 29. Morgan Kaufmann, San Mateo, CA. Bishop, C. M. 1990. Curvature-driven smoothing in feedforward networks. In Proceedings of the lnternational Neural Network Conference, Paris, Vol. 2, p. 749. To be published in I E E E Transactions on Neural Networks. Bishop, C. M. 1991. A fast procedure for re-training the multilayer perceptron. Int. I. Neurat Syst. 2(3), 229-236. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural lnformation Processing Systems, Vol. 2, D. S. Touretzky, ed., p. 598. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1991. A practical Bayesian framework for backprop networks. Neural Comp. 4(3), 448472.

Exact Calculation of the Hessian Matrix

501

Ricotta, L. I?, Ragazzini, S., and Martinelli, G. 1988. Learning of word stress in a suboptimal second order backpropagation neural network. In Proceedings IEEE International Conference on Neural Networks, San Diego, Vol. 1, p. 355. Solla, S. A., Levin, E., and Fleisher, M. 1988. Accelerated learning in layered neural networks. Complex Syst. 2, 625-640.

Received 5 August 1991; accepted 2 October 1991.

This article has been cited by: 1. Krzysztof M. Graczyk, Piotr Płonski, Robert Sulej. 2010. Neural network parameterizations of electromagnetic nucleon form-factors. Journal of High Energy Physics 2010:9. . [CrossRef] 2. Maria P. Cadeddu, David D. Turner, James C. Liljegren. 2009. A Neural Network for Real-Time Retrievals of PWV and LWP From Arctic Millimeter-Wave Ground-Based Observations. IEEE Transactions on Geoscience and Remote Sensing 47:7, 1887-1900. [CrossRef] 3. Shangling Song, Kazuhiko Ohnuma, Zhi Liu, Liangmo Mei, Akira Kawada, Tomoyuki Monma. 2009. Novel biometrics based on nose pore recognition. Optical Engineering 48:5, 057204. [CrossRef] 4. Kar-Ann Toh. 2008. Deterministic Neural ClassificationDeterministic Neural Classification. Neural Computation 20:6, 1565-1595. [Abstract] [PDF] [PDF Plus] 5. D. Erdogmus, O. Fontenla-Romero, J.C. Principe, A. Alonso-Betanzos, E. Castillo. 2005. Linear-Least-Squares Initialization of Multilayer Perceptrons Through Backpropagation of the Desired Response. IEEE Transactions on Neural Networks 16:2, 325-337. [CrossRef] 6. L. M. Raff, M. Malshe, M. Hagan, D. I. Doughan, M. G. Rockley, R. Komanduri. 2005. Ab initio potential-energy surfaces for complex, multichannel systems using modified novelty sampling and feedforward neural networks. The Journal of Chemical Physics 122:8, 084104. [CrossRef] 7. Yoshua Bengio . 2000. Gradient-Based Optimization of HyperparametersGradient-Based Optimization of Hyperparameters. Neural Computation 12:8, 1889-1900. [Abstract] [PDF] [PDF Plus] 8. F. Vivarelli, P. Fariselli, R. Casadio. 1997. The prediction of protein secondary structure with a Cascade Correlation Learning Architecture of neural networks. Neural Computing & Applications 6:1, 57-62. [CrossRef] 9. R. Parisi, E.D. Di Claudio, G. Orlandi, B.D. Rao. 1996. A generalized learning paradigm exploiting the structure of feedforward neural networks. IEEE Transactions on Neural Networks 7:6, 1450-1460. [CrossRef] 10. H.H. Thodberg. 1996. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks 7:1, 56-72. [CrossRef] 11. Chris M. Bishop . 1995. Training with Noise is Equivalent to Tikhonov RegularizationTraining with Noise is Equivalent to Tikhonov Regularization. Neural Computation 7:1, 108-116. [Abstract] [PDF] [PDF Plus] 12. Barak A. Pearlmutter . 1994. Fast Exact Multiplication by the HessianFast Exact Multiplication by the Hessian. Neural Computation 6:1, 147-160. [Abstract] [PDF] [PDF Plus]

13. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 14. David J. C. MacKay . 1992. The Evidence Framework Applied to Classification NetworksThe Evidence Framework Applied to Classification Networks. Neural Computation 4:5, 720-736. [Abstract] [PDF] [PDF Plus]

Communicated by Thomas Brown

NMDA-Based Pattern Discrimination in a Modeled Cortical Neuron Bartlett W. Me1 Computation and Neural Systems Program, Division of Biology, 216-76, California Institute of Technology, Pasadena, C A 91125 U S A

Compartmental simulations of an anatomically characterized cortical pyramidal cell were carried out to study the integrative behavior of a complex dendritic tree. Previous theoretical (Feldman and Ballard 1982; Durbin and Rumelhart 1989; Me1 1990; Me1 and Koch 1990; Poggio and Girosi 1990) and compartmental modeling (Koch et al. 1983; Shepherd et al. 1985; Koch and Poggio 1987; Rall and Segev 1987; Shepherd and Brayton 1987; Shepherd ef al. 1989; Brown e t al. 1991) work had suggested that multiplicative interactions among groups of neighboring synapses could greatly enhance the processing power of a neuron relative to a unit with only a single global firing threshold. This issue was investigated here, with a particular focus on the role of voltagedependent N-methyl-D-asparate (NMDA) channels in the generation of cell responses. First, it was found that when a large proportion of the excitatory synaptic input to dendritic spines is carried by NMDA channels, the pyramidal cell responds preferentially to spatially clustered, rather than random, distributions of activated synapses. Second, based on this mechanism, the NMDA-rich neuron is shown to be capable of solving a nonlinear pattern discrimination task. We propose that manipulation of the spatial ordering of afferent synaptic connections onto the dendritic arbor is a possible biological strategy for pattern information storage during learning. 1 Introduction

The cerebral neocortex and its connections account for almost 90% of the human brain by weight (Hofman 1989). Our understanding of its functions can be significantly advanced by an accurate input-output model for the individual pyramidal cell, the principal and most numerous neocortical cell type. How does an individual cortical pyramidal cell integrate its synaptic inputs over space and time? Which spatiotemporal patterns of synaptic activation in the dendritic tree lead to the firing of action potentials, and which do not? In short, what kind of "device" is the cortical pyramidal cell? Neural Computation

4, 502-517 (1992)

NMDA-Based Discrimination in a Cortical Neuron

503

The response of a pyramidal neuron depends on a large number of variables, such as the detailed anatomy of the dendritic tree, numerous biophysical cell parameters, the number and distribution of synaptic inputs to the dendritic tree and their firing frequencies, and the strengths and time courses of the several excitatory and inhibitory synaptic conductances that are known to exist in these cells. The direct study of the input-output behavior of a dendritic tree under conditions of varying synaptic stimulation is not currently possible in the laboratory for technical reasons. The primary tool currently available for the study of complex dendritic trees is compartmental modeling, used to compute the time course of currents, voltages, and conductances in an electrical circuit model of an arbitrary neuron (Rall 1964; Perkel and Mulloney 1978; Traub 1982; Koch et al. 1983; Shepherd et al. 1985; Koch and Poggio 1987; Rall and Segev 1987; Shepherd and Brayton 1987; Shepherd et al. 1989; Brown et al. 1991). Our initial hypothesis as to the integrative behavior of the neocortical pyramidal cell was derived from a simple abstract neuron type, called a sigma-pi unit (Rumelhart et al. 1986). The sigma-pi unit computes its response as a sum of contributions from a set of multiplicative clusters of inputs, and is in this sense a neural instantiation of a polynomial function. Three factors have recommended this single-unit abstraction as the seed for a biological model. First, direct monosynaptic connections between units with multiplicative input stages can encode a very general class of nonlinear associations (Feldman and Ballard 1982; Giles and Maxwell 1987; Durbin and Rumelhart 1989; Me1 1990; Me1 and Koch 1990; Poggio and Girosi 1990). Second, the learning of such associations can be achieved with a simple Hebbian rule of a general type known to exist in the central nervous system (Bliss and L0mo 1973; Brown et al. 1988; Sejnowski and Tesauro 1989). Third, numerous suggestions have been made over the past decade that voltage-dependent membrane mechanisms in dendritic trees could underlie the multiplicative interactions among neighboring synapses needed for this type of model neuron (Feldman and Ballard 1982; Koch et a2. 1983; Shepherd et al. 1985; Koch and Poggio 1987; Rall and Segev 1987; Shepherd and Brayton 1987; Brown et al. 1988; Shepherd et al. 1989; Durbin and Rumelhart 1989; Me1 1990; Me1 and Koch 1990; Poggio and Girosi 1990; Brown et al. 1991). 2

T h e Biophysical Model

We sought evidence for sigma-pi-like behavior in modeling studies of a serially reconstructed layer 5 pyramidal cell from cat visual cortex (Douglas et al. 1991). The compartmental modeling program, called NEURON (Hines 19891, mapped lengths, diameters, and interconnectivity pattern of the dendritic branches into an equivalent electrical circuit that could be stimulated and probed at will. The dendritic tree shown in Figure 1 was

504

Bartlett W. Me1

Figure 1: Testing for cluster sensitivity. Default membrane parameters were R, = 21,000 Ct-cm2, C, = 1.0 pF/cm2, Ri = 306 0-cm taken from a study of a layer 5 pyramidal cell in rat visual cortex (Stratford et al. 1990). (A-F) ZeroNMDA condition. In A, 80 spines were selected at random, designated by black circles. Each spine contained a synapse with a 0.8 nS AMPA conductance. In this and all other simulation runs, each synapse was stimulated asynchronously at 100 Hz for 100 msec, with a randomized interspike interval. Synaptic voltage traces are shown in B, and voltage at the cell body is shown in C where four spikes were counted. Experiment was repeated in (D-F)with 80 synapses in clusters of size 6 (plus one partial cluster of size 2). All synapses within a cluster normally fell within one 20 pm dendritic compartment. Subsynaptic voltage traces in E are larger on average than those in 8, reflecting the fact that the postsynaptic depolarization i s indeed boosted where synapses are closely grouped. However, each AMPA synapse injects less current postsynaptically in its strongly depolarized state. The response at the soma (F), a measure of total synaptic current, is thus attenuated. Continued facing page.

NMDA-Based Discrimination in a Cortical Neuron

505

K

H

I

Figure 1: Continued. (G-L) High-NMDA condition. Experimentswere repeated with NMDA peak conductance gN = 0.9 nS, and gA = 0.1 nS. A preference is now seen for clusters of size 6 (J-L) vs. clusters of size 1 (G-I) since the NMDA channels become more effective as they become more strongly depolarized.

thus converted into 903 coupled electrical compartments, one for each 20 pm length of dendrite. Membrane for unmodeled dendritic spines was incorporated into dendritic branches as in Stratford et al. (1990). All default simulation parameters are collected in Table 1. In each simulation run, a set of locations (usually 80, but up to 200) was selected from across the dendritic tree, marking sites to be synaptically stimulated. At each location, an electrical compartment representing a cylindrical spine (1.0 x 0.15 pm) was attached to the associated dendritic compartment. The distal end of each spine contained an

Bartlett W. Me1

506

excitatory synapse with both a fast non-NMDA, or AMPA, conductance, and a slow, voltage-dependent NMDA conductance (Mayer and Westbrook 1987). The voltage-independent AMPA conductance was modeled as an alpha-function (see Stratford et al. 1990): G A = g,Kte-'/' K = e / r , where peak conductance gAvaried between 0.1 to 1 nS, and time to peak was T = 1 msec. The NMDA conductance depended exponentially on membrane voItage V , and time, modeled as follows (Jahr and Stevens 1990a,b; Zador et al. 1990):

with peak conductance gNalso in the range of 0.1 to 1.0 nS, 71 = 40 msec, 7 2 = 0.33 msec, 17 = 0.33/mM, [Mg2+]= 1 mM, and y = 0.06/mV. The long NMDA time constant leads to the possibility of conductance summation across spikes in a high-frequency train; in most simulations, the NMDA Figure 2: Facing page. (A) Ten randomly generated synaptic layouts were run for each cluster size between 1 and 9 with only AMPA channels; summary plot shows predicted monotonic decline in cell response as a function of increasing cluster size. (B) In case of high dendritic NMDA, response of cell shows initially increasing, nonmonotonic dependence on cluster size. Extensive exploration of the biophysical parameter space indicated that the observed NMDA-based cluster sensitivity phenomenon is extremely robust. Parameters varied in plots C-F above are indicated in the upper left corner of each plot. Complete list of parameter variations (usualIy taken one or a few at a time) included membrane resistivity from 10,000 to 100,000 R-cm2 (C), membrane capacitance from 0.9 to 2.0 pF/cm2, cytoplasmic resistivity from 100 to 362 R cm (several configurations were taken from Stratford et al. 1990); synaptic conductance waveforms were changed from saturating to non-saturating curves (D), spine membrane correction was used or not, count of activated synapses was varied from 80 to 200 (E). Synaptic strengths were normally uniform across the entire tree; in one run (F), peak synaptic conductances were set proportional to local dendritic input resistance, leading to a factor of 30 variation in conductance from the finest to the thickest dendritic branches. Spine neck resistance was also increased in one run from 173 MR 6-fold to 1038 MR, which led to a significant reduction, but did not abolish, the cluster sensitivity (i.e. there persisted a 50% increase in average cell response as clusters grew to from size 1 to size 3). NMDA-based cluster sensitivity was also seen for both apical and basal dendrites individually, though absolute difference between clustered and unclustered activation patterns as measured at a passive soma was significantly larger for basal (40%) than apical (18%)dendrites. Finally, in a pilot study on a smaller layer 3 pyramidal cell (Douglas et al. 1991), cluster sensitivity was seen at spike train frequencies of 20 and 100 Hz, or when only a single spike was delivered to each synapse.

NMDA-Based Discrimination in a Cortical Neuron

507

conductance was assumed to saturate to gN,corresponding to the case of many presynaptic transmitter molecules and few receptor sites. In one case (Fig. 2D), NMDA conductance was allowed to summate across presynaptic spikes, and could thus achieve values much larger than gN. In most experiments, all of the AMPA conductances were assigned the same peak value, as were all of the NMDA conductances, independent of location on the dendritic tree. The NMDA channel is normally discussed for its role in synaptic plasticity (see Collingridge and Bliss 1987; Brown ef al. 1988). For present 3

A

ID mns

default parameters

1.5

M r

m D. 0.5

0.0

0

2

4

6

8

1

0

0

22

4 4

6

6 8

81

0

Cluster Size

Cluster Size C

non-saturating synapses

M

75

m R

m a 05

,a111

0.0 0

2

4

6

8

1

0

0

2

Cluster Size

4

6

8

Cluster Size

E 2.5-

conductances scaled to input resistance (factor of= variation)

2.0

m 1.0

, 0

2

4

6

0 8

Cluster Size

1

. 0

0 0

2

4

6

Cluster Size

8

I

Bartlett W. Me1

508

Table 1: Default Simulation Parameters.

Rm

Cm

Ri Input resistance (soma) Time constant (soma)

21, OOO R-cm2 1.0 pF/cm2 306 (2-cm 51 MR 19 msec

Resting potential -70 mV Dendritic spines I 1.0 pm, D = 0.15 pm, neck resistance = 173 MR Synaptic stimulation 100 Hz with randomized interspike interval 80, stimulated asynchronously Active synapses gA= 0.1-1.0 nS AMPA conductance AMPA time constant T = 1 msec NMDA conductance gN = 0.1 to 1.0 nS NMDA time constants TI = 40 msec, ~2 = 0.33 msec NMDA Parameters 77 = 0.33/mM, [Mg2+1 = 1 mM, and y = 0.06/mV Synaptic reversal potential 0 mV (both AMPA and NMDA) Compartments 903 plus 1 per activated dendritic spine Compartment length 20 pm Integration time step 0.1 msec

purposes, however, the crucial difference between AMPA and NMDA channels relates to their dependence on voltage. Briefly, the synaptic current injected through an AMPA channel declines steadily as the membrane in which it sits is depolarized from rest. The depolarizing effect of one AMPA synapse thus reduces the effectivenessof neighboring AMPA synapses, and vice versa. In contrast, NMDA current increases steadily as the membrane is depolarized from -70 to -30 mV (Mayer and Westbrook 1987), such that NMDA synapses can be more effective when activated in groups. Except for NMDA channels, the dendritic membrane was electrically passive. The soma contained Hodgkin-Huxley-style sodium and potassium channels in addition to a passive leak conductance. At the beginning of each run, the entire cell was set to a resting potential of -70 mV. A 100-Hz spike train was then delivered asynchronously to each of the preselected synapses, and the response at the soma was "recorded over a 100-msec period. The primary measure of cell response was the number of spikes generated at the soma during the first 100 msec of synaptic stimulation, but passive-soma response measures were also used with similar results, including peak somatic potential, and time integral of somatic potential. Our main hypothesis held that if sufficiently endowed with NMDA channels, the cell would respond selectively to patterns of stimulation in which the activated synapses were clustered., rather than uniformly distributed, across the dendritic tree. We first examined sensitivity to

NMDA-Based Discrimination in a Cortical Neuron

509

cluster size in a dendritic tree with only AMPA channels (Fig. 1A-F). Sample runs are shown for clusters of size 1 (A,B,C) and 6 (D,E,F). As expected, clusters of 6 led to higher average subsynaptic depolarization (E vs. B), and hence less current injection per synapse. In consequence, the cell response was significantly reduced (F vs. C). This experiment was repeated with 90% of the total peak conductance assigned to NMDA channels.' In contrast to the pure AMPA case, a larger response was seen at the cell body when synapses were distributed in clusters of 6 0, K, L), as compared to clusters of 1 (G, H, I). Note that "clustering" here refers to the spatial configuration of activated synapses, and not an underlying inhomogeneity in the anatomical distribution of dendritic spines. To confirm these effects statistically, 10 random layouts were generated for each cluster size from 1 to 9, for both zero-NMDA and highNMDA conditions. Figure 2A shows the predicted monotonic decline in cell response as a function of increasing cluster size when only AMPA channels were present. In contrast, Figure 2B shows for the high-NMDA case an initially increasing, nonmonotonic dependence on cluster size. The initial increase in cell response as clusters grow larger is explained by the average increase in membrane potential within each cluster, which leads to increased average effectiveness of the NMDA synapses. The subsequent fall-off of the response at larger cluster sizes is due to the diminishing driving potential as the membrane under each cluster saturates to the synaptic reversal potential. The specific cluster size at which the somatic response peaked, in this case 5, was not fixed, but typically varied over a range of 5 to 9 depending on the value of peak conductance assigned to the activated synapses. This basic effect of nonmonotonic cluster sensitivity was observed to hold under a very wide range of biophysical parameter manipulations (Fig. 2C-F). We conclude that a cortical pyramidal cell with a major proportion of its excitatory synaptic current carried by NMDA channels could respond preferentially to clustered distributions of activated synapses. In a pilot study on a smaller layer 2-3 pyramidal cell, a continuous increase in cluster sensitivity was seen as the proportion of peak NMDA conductance [i.e., g,/(g, +g,)] was increased from 0 to 100%. In the current study, it was observed that a small proportion of AMPA conductance was useful in rapidly boosting the local membrane potential, and hence neighboring NMDA channels, into the voltage range where the latter begin to open significantIy. The actual proportion of NMDA conductance in cortical pyramidal cells has not yet been determined, though its contribution to 'Note that due to its dependence on voltage, the actual NMDA conductance achieved during synaptic stimulation will not normally approach the nominal maximum value EN.For example, with a 90% proportion of NMDA peak conductance, that is, gN/(gN+ gA)= 0.9, the peak EPSC in response to a synchronous single-shock activation of 100 randomly placed excitatory synapses for the cell and parameters of Fig.1 is only 3 W % due to current through NMDA channels. However, at least 85% of the total synaptic charge in this case is camed by the NMDA conductance, since its time constant is relatively much longer.

510

Bartlett W. Me1

normal synaptic activation in these cells is known to be large (Miller et al. 1989; Fox et al. 1989; see also Keller et al. 1991 with respect to hippocampal granule cells). 3 A Basis for Nonlinear Pattern Discrimination

Regarding the biological utility of dendritic cluster sensitivity, we observe that any mechanism that allows a cell to distinguish certain patterns of stimulation from others can act as a basis for memory. For example, if the subset of afferents activated in stimulus condition A terminate in a clustered configuration on the dendrite of our pyramidal cell, then pattern A will fire the cell with high probability, and will thus be “preferred.” Another pattern, B, for which no such clustering condition holds, will with high probability fail to fire the cell. To establish a first empirical lower bound o n the nonlinear pattern discrimination capacity of the single pyramidal cell, we sought to load a set of pattern “preferences” onto the dendritic tree in just this way. One hundred gray-level images of natural scenes were used as a convenient set of complex high-dimensional patterns. A simple visual preprocessing Figure 3 Facing page. Gray-level images of natural scenes were used as a convenient set of complex high-dimensional training and test patterns. Each image was mapped onto a sparse binary pattern of activity over a set of orientationselective visual filters. Four filters were centered at each of the 64 by 64 image pixel locations, sensitive to horizontal, vertical, and right- and left-oblique intensity edges, giving a total population of 16,384 orientation-selective visual ”units.” A global thresholding operation guaranteed that exactly 80 visual units were active in response to any presented image. Each visual unit gave rise to an afferent “axon” that could provide input to the dendritic tree. A training image was “loaded by arranging that the 80 afferents activated by the image terminated in a clustered configuration on the dendritic tree that had been previously observed to fire the cell.- After 50 images were loaded in this way, at most 4000 afferents made a synapse onto the dendrite-fewer in actuality due to feature overlap among training images. All remaining uncommitted visual afferents (> 12,384) were randomly mapped onto the dendrite as well, in order that any presented image should drive exactly 80 synapses in the dendritic tree. This resulted in a total of 16,384 synaptic ”contacts” onto the dendritic surface, at a maximum density of one spine per micrometer of dendritic length. While a biologically unrealistic number of synapses for this single cell (5000 more probable), this control measure guaranteed that any observed preference for trained vs. control images in the experiments of Figure 4 was the result of the pattern of synaptic clustering, and not a result of total synaptic activation. Note that the response properties of the orientation-selective visual units and the pattern of afferentation described here are not intended as models of cortical visual processing; any other type of (nonvisual) patterns and preprocessing stage might have been used.

NMDA-Based Discrimination in a Cortical Neuron

511

step was used to map each image onto a sparse binary pattern of “neural” activity, consisting of 80 active lines within a bundle of 16,384 orientationselective visual “axons” (Fig. 3). Another source of patterns or another type of preprocessing could have been used for these experiments; the visual modality was chosen only for convenience, and was not intended as a model of any aspect of cortical visual processing. Fifty images were chosen to act as a training set, 50 as a control set. For the first training image, a layout of 80 synapses in clusters of 8 was generated that produced at least one spike at the soma during 100 msec of 100 Hz stimulation (about 50% of clustered layouts of this kind generated at least one spike). The 80 orientation-selectivevisual afferents driven by the training image were then randomly mapped onto the 80 dendritic locations specified by the precomputed clustered layout (Fig. 3). The process was repeated for each training image in turn,except that a visual afferent, once mapped onto the dendritic tree, was never remapped in the

80 features activated by each image

16,384 oriented feature detectors

- \

t / image \

512

Bartlett W. Me1

processing of a subsequent training image. When h i s loading procedure was complete, all remaining uncommitted visual afferents were randomly mapped to the dendrite as well, so that any image, trained or untrained, activated exactly 80 synapses of the same strength and with the same intensity. Five different image types were then presented to the cell, including intact training images, three types of corrupted training images, and control images. Average cell responses to each during 100 msec are shown in Fig. 4. The most prominent feature of these results is the cell's overwhelming selectivity for the 50 trained vs. 50 control images. Only 1 in 100 control images presented to the cell in two complete runs generated a single spike. In contrast, 87%of the trained images generated at least one spike, with an average spike count of 1.25. Intermediate responses were seen for the three categories of corrupted training images. It is important to note that the procedure used for "storing" training patterns in the dendritic tree was designed to make the task of discriminating training vs. control patterns as difficult as possible for a linear neuron with only a global thresholding nonlinearity (e.g., a perceptron), since every presented pattern activated the same number of synaptic inputs, with the same synaptic strength. The much greater probability of firing an action potential in response to a trained (clustered) vs. control (diffuse) pattern is thus evidence that the dendritic tree is a more powerful pattern discriminator than a simple perceptron. Significantly reduced cell responses to linear superpositions of 2 training patterns is further evidence for this conclusion (Fig. 4). We may relate the standard measure of pattern discrimination capacity used for the simple perceptron to the present situation. Cover (1965) proved that an N-input perceptron has a capacity of 2N patterns, counting both training and control patterns (see Hertz et al. 1991). This result means that for 2N randomly chosen patterns, some randomly designated as training patterns, others as controls, the probability that a perceptron can perfectly discriminate all training from control patterns is 1/2. This simple result cannot be applied in the present situation, however, as we consider here instead the ability of the neuron to discriminate k training patterns (in this case 50) from all other possible patterns acting as controls. In this more demanding case, we do not require classification performance to be perfect, but rather that it fall within an acceptable error tolerance. For example, in the present nonoptimized experiment with 50 stored training images, the probability of misclassification was 14%(13% false negatives, 1 % false positives). 4 Conclusions

The single statistical parameter, cluster size, has proved to be a remarkably good predictor of responses to complex patterns of synaptic in-

513

NMDA-Based Discrimination in a Cortical Neuron

Decline in Cell Response as Training Patterns are Degraded

1.o

0.5

0.0

Training Pamerns

50% 20% Conuption Corruption

Test Patterns

Figure 4: Average cell responses to five types of presented images, combined over two runs with 50 training and control images interchanged. Largest responses were to whole training images and smallest to control images. In three intermediate cases, progressively attenuated responses were seen to (1) random 50/50 composites of features taken from two training images, (2) training images with 20%, and (3) 50% of their visual features randomly replaced. Pictorial representation of each category is only suggestive of these manipulations; corruptions to training images were actually carried out within the internal orientation-selective visual representation. The 50/50 composite case provides direct evidence that the dendritic tree is capable of nonlinear discrimination, since a thresholded linear neuron would respond equivalently to any such superposition of training patterns.

514

Bartlett W. Me1

put in an NMDA-rich neuron. Further, a simple experiment has shown how the ability to respond selectively to statistically inhomogeneous distributions of activated synapses could act as a basis for nonlinear pattern discrimination within a dendritic tree. It must be noted, however, that other membrane mechanisms may exist in pyramidal dendrites in addition to NMDA channels that could mimic, enhance, alter, or abolish the cluster-based integrative behavior observed here. Nonetheless, the fact that NMDA channels can by themselves underlie a capacity for nonlinear pattern discrimination in dendritic trees is, in and of itself, useful knowledge. Beyond the widely accepted proposition that modifiable synaptic weights are the likely sites for neural information storage, these results highlight the probable importance of synaptic ordering in the generation of cortical cell responses as well. Given the rich combinatorics that are possible in choosing spatial arrangements of lo4 synapses in a dendritic tree, the true discrimination capacity of the single pyramidal cell is likely to be much greater than the level that has been empirically demonstrated here. In this vein, preliminary experiments have been conducted on an abstract cluster-sensitiveneuron with 10,000synapses in which the directloading procedure described above was replaced by a Hebb-type learning rule. Much larger discrimination capacities were easily achieved in this way for the abstract neuron (unpublished observations); application of a similar learning scheme within the present biophysical modeling context awaits further work (but see Brown et al. 1991). Elsewhere (Me1 1990; Me1 and Koch 19901, it has been proposed that the three-dimensional geometry of intersecting axonal and dendritic arbors in the neocortical column are ideally suited to allow a given axon to establish synaptic contacts at many different dendritic loci as dictated by the constraints of a learning task. It must also be noted that the pyramidal cell is not alone in the cortex, but is a member of a group of cells within the cortical microcolumnar structure (Jones 1981; Peters and Kara 1987) within which a high degree of interaction is likely to occur. Within such a network of neurons, however, the issue of storage capacity is profound, and will need to be addressed in the future both analytically and through more sophisticated simulations. Acknowledgments We are greatly indebted to Rodney Doulas and Kevan Martin for providing us with anatomical data from two cortical pyramidal cells, to John Moore and Michael Hines for their compartmental modeling program, NEURON, and to Tom Tromey for creating a 3-D graphical interface to NEURON. Thanks to qvind Bernander, Christof Koch, Ken Miller, Ernst Niebur, and Stephanie Me1 for useful discussions and many helpful comments on the manuscript. This work was supported by grants from the

NMDA-Based Discrimination in a Cortical Neuron

515

the National Institutes of Mental Health, the Office of Naval Research, and the James S. McDonnell Foundation.

References Bliss, T. V. P., and L0m0, T. 1973. Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 331-356. Brown, T. H., Chang, V. C., Ganong, A. H., Keenan, C. L., and Kelso, S. R. 1988. Biophysical properties of dendrites and spines that may control the induction and expression of long-term synaptic potentiation. In Long-Term Potentiation: From Biophysics to Behavior, 201-264. Alan R. Liss, New York. Brown, T. H., Mainen, Z. F., Zador, A. M., and Claiborne, B. J. 1991. Selforganization of hebbian synapses in hippocampal neurons. In Advances in Neural lnformation Processing Systems, Vol. 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 3945. Morgan Kaufmann, Palo Alto. Collingridge, G. L., and Bliss, T. V. P. 1987. NMDA receptors: Their role in long-term potentiation. Trends Neurocsi. 10, 288-293. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Elect. Comput. 14,326-334. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1991. An intracellular analysis of the visual responses of neurons in cat visual cort. J. Physiol. 440, 659-696. Durbin, R., and Rumelhart, D. E. 1989. Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Cornp. 1, 133. Feldman, J. A., and Ballard, D. H. 1982. Connectionist models and their properties. Cog. Sci. 6, 205-254. Fox, K., Sato, H., and Daw, N. W. 1989. The location and function of NMDA receptors in cat and kitten visual cortex. 1.Neurosci. 9, 2443-2454. Giles, C. L., and Maxwell, T. 1987. Learning, invariance, and generalization in high-order neural networks. Appl. Opt. 26(23), 4972-4978. Hertz, J.,Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. Hines, M. 1989. A program for simulation of nerve equations with branching geometries. lnt. J. Biomed. Comput. 24, 55-68. Hofman, M. A. 1989. On the evolution and geometry of the brain in mammals. Prog. Neurobiol. 32, 137-158. Jahr, C. E., and Stevens, C. F. 1990a. A quantitative description of NMDA receptor-channel kinetic behavior. J. Neuroscience 10, 1830-1837. Jahr, C. E., and Stevens, C. F. 1990b. Voltage dependence of NMDA-activated macroscopic conductances predicted by single-channel kinetics. J. Neurosci. 10, 3176-3182. Jones, E. G. 1981. Anatomy of cerebral cortex: Columnar input-output relations.

516

Bartlett W. Me1

In The Organization of CerebralCortex, F. 0.Schmitt, F. G. Worden, G. Adelman, and S. G. Dennis, eds. MIT Press, Cambridge, MA. Keller, B. U., Konnerth, A., and Yaari, Y.1991. Patch clamp analysis of excitatory synaptic currents in granule cells of rat hippocampus. J. Physiol. 435, 275293. Koch, C., and Poggio, T. 1987. Biophysics of computation: Neurons, synapses, and membranes. In Synaptic Function, G. E. Edelman, W. F. Gall, and W. M. Cowan, eds., pp. 637-698. Wiley, New York. Koch, C., Poggio, T., and Torre, V. 1983. Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proc. Natl. Acad. Sci. U.S.A. 80, 2799-2802. Mayer, M. L., and Westbrook, G. L. 1987. The physiology of excitatory amino acids in the vertebrate central nervous system. Prog. Neurobiol. 28, 197-276. Mel, B. W. 1990. The sigma-pi column: A model for associative learning in cerebral neocortex. CNS Memo #6, Computation and Neural Systems Program, California Institute of Technology. Mel, B. W., and Koch, C. 1990. Sigma-pi learning: On radial basis functions and cortical associative learning. In Advances in AJeuralInformation Processing Systems, Vol. 2, D. S. Touretzsky, ed. Morgan Kaufmann, San Mateo, CA. Miller, K. D., Chapman, B., and Stryker, M. P. 1989a. Visual responses in adult cat visual cortex depend on N-methyl-D-aspartate receptors. Proc. Natl. Acad. Sci. U.S.A. 86, 5183-5187. Perkel, D.H., and Mulloney, 8.1978. Electrotonic properties of neurons: steadystate compartmental model. J. Neurophysiol. 41, 621-639. Peters, A., and Kara, D. A. 1987. The neuronal composition of area 17 of rat visual cortex. IV. The organization of pyramidal cells. J. Comp. Neurol. 260, 573-590. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. RalI, W. 1964. Theoretical significance of dendritic trees for neuronal inputouput relations. In Neural Theoy and Modeling, R. F. Reiss, ed., pp. 73-97. Stanford University Press, Stanford, CA. Rall, W., and Segev, I. 1987. Functional possibilities for synapses on dendrites and on dendritic spines. In SynapticFunction, G. E. Edelman, W. F. Gall, and W. M. Cowan, eds., pp. 605-636. Wiley, New York. Rumelhart, D. E., Hinton, G. E., and McClelland, J. L. 1986. A general framework for parallel distributed processing. In Parallel Distributed Processing: Explorations in theMicrostructureof Cognition, Vol. 1, D. E. Rumelhart, J. L. McClelland, eds., pp. 45-76. Bradford, Cambridge, MA. Sejnowski, T. J., and Tesauro, G. 1989. The Hebb rule for synaptic plasticity: Algorithms and implementations. In Neural Models of Plasticity, J. H. Byrne and W. Berry, eds., pp. 94-103. Academic Press, Cambridge, MA. Shepherd, G. M., and Brayton, R. K. 1987. Logic operations are properties of computer-simulated interactions between excitable dendritic spines. Neurosci. 21, 151-166. Shepherd, G. M., Brayton, R. K., Miller, J. P., Segev, I., Rinzel, J., and Rall, W. 1985. Signal enhancement in distal cortical dendrites by means of in-

NMDA-Based Discrimination in a Cortical Neuron

517

teractions between active dendritic spines. Proc. Natl. Acad. Sci. U.S.A. 82, 2192-2195. Shepherd, G. M., Woolf, T. B., and Carnevale, N. T. 1989. Comparisons between active properties of distal dendritic branches and spines: Implications for neuronal computations. Cog. Neurosci. 1, 273-286. Stratford, K., Mason, A., Larkman, A., Major, G., and Jack, J. 1990. The modeling of pyramidal neurons in visual cortex. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, eds., pp. 296-321. Addison-Wesley, Wokingham, England. Traub, R. D. 1982. Simulation of intrinsic bursting in CA3 hippocampal neurons. Neurosci. 7, 1233-1242. Zador, A., Koch, C., and Brown, T. H. 1990. Biophysical model of a hebbian synapse. Proc. Natl. Acad. Sci. U.S.A. 87, 671845721.

Received 25 July 1991; accepted 26 November 1991.

This article has been cited by: 2. Eduardo Mizraji, Juan Lin. 2010. Logic in a Dynamic Brain. Bulletin of Mathematical Biology . [CrossRef] 3. Yingxue Wang, Shih-Chii Liu. 2010. Multilayer Processing of Spatiotemporal Spike Patterns in a Neuron with Active DendritesMultilayer Processing of Spatiotemporal Spike Patterns in a Neuron with Active Dendrites. Neural Computation 22:8, 2086-2112. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Paul A. Rhodes. 2008. Recoding Patterns of Sensory Input: Higher-Order Features and the Function of Nonlinear Dendritic TreesRecoding Patterns of Sensory Input: Higher-Order Features and the Function of Nonlinear Dendritic Trees. Neural Computation 20:8, 2000-2036. [Abstract] [PDF] [PDF Plus] 5. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 6. C. Rasche, R.J. Douglas. 2001. Forward- and backpropagation in a silicon dendrite. IEEE Transactions on Neural Networks 12:2, 386-393. [CrossRef] 7. Panayiota Poirazi , Bartlett W. Mel . 2000. Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic ClassifierChoice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classifier. Neural Computation 12:5, 1189-1205. [Abstract] [PDF] [PDF Plus] 8. P. C. Bressloff, S. Coombes. 2000. Solitary Waves in a Model of Dendritic Cable with Active Spines. SIAM Journal on Applied Mathematics 61:2, 432. [CrossRef] 9. J. Köhn , F. Wörgötter . 1998. Employing the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network SimulationsEmploying the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations. Neural Computation 10:7, 1639-1651. [Abstract] [PDF] [PDF Plus] 10. Barak A. Pearlmutter . 1995. Time-Skew Hebb Rule in a Nonisopotential NeuronTime-Skew Hebb Rule in a Nonisopotential Neuron. Neural Computation 7:4, 706-712. [Abstract] [PDF] [PDF Plus] 11. Federico Girosi , Michael Jones , Tomaso Poggio . 1995. Regularization Theory and Neural Networks ArchitecturesRegularization Theory and Neural Networks Architectures. Neural Computation 7:2, 219-269. [Abstract] [PDF] [PDF Plus] 12. Christopher J. Coomber. 1995. Compartmental modelling with artificial neural networks. Neural Processing Letters 2:1, 13-18. [CrossRef] 13. Bartlett W. Mel . 1994. Information Processing in Dendritic TreesInformation Processing in Dendritic Trees. Neural Computation 6:6, 1031-1085. [Abstract] [PDF] [PDF Plus]

14. Erik De Schutter , James M. Bower . 1993. Sensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal NeuronsSensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal Neurons. Neural Computation 5:5, 681-694. [Abstract] [PDF] [PDF Plus] 15. Barry Horwitz, Olaf Sporns. 1993. Neural modeling and functional neuroimaging. Human Brain Mapping 1:4, 269-283. [CrossRef]

Communicated by Wilfrid Rall

The Impact of Parallel Fiber Background Activity on the Cable Properties of Cerebellar Purkinje Cells Moshe Rapp Yosef Yarom Idan Segev Department of Neurobiology, Institute of Life Sciences, The Hebrew University, Jerusalem, 92904, Israel

Neurons in the mammalian CNS receive 104-105 synaptic inputs onto their dendritic tree. Each of these inputs may fire spontaneously at a rate of a few spikes per second. Consequently, the cell is bombarded by several hundred synapses in each and every millisecond. An extreme example is the cerebellar Purkinje cell (PC) receiving approximately 100,000 excitatory synapses from the parallel fibers (p.f.s) onto dendritic spines covering the thin dendritic branchlets. What is the effect of the p.f.s activity on the integrative capabilities of the PC? This question is explored theoretically using analytical cable models as well as compartmental models of a morphologically and physiologically characterized PC from the guinea pig cerebellum. The input of individual p.f.s was moaeled as a transient conductance change, peaking at 0.4 nS with a rise time of 0.3 msec and a reversal potential of f60 mV relative to rest. We found that already at a firing frequency of a few spikes per second the membrane conductance is several times larger than the membrane conductance in the absence of synaptic activity. As a result, the cable properties of the PC significantly change; the most sensitive parameters are the system time constant (70)and the steady-state attenuation factor from dendritic terminal to soma. The implication is that the cable properties of central neurons in freely behaving animals are different from those measured in slice preparation or in anesthetized animals, where most of the synaptic inputs are inactive. We conclude that, because of the large conductance increase produced by the background activity of the p.f.s, the activity of the PC will be altered from this background level either when the p.f.s change their firing frequency for a period of several tens of milliseconds or when a large population of the p.f.s fires during a narrow time window. Neural Computation 4, 518-533 (1992) @ 1992 Massachusetts Institute of Technology

Modulation of Cable Properties

519

1 Introduction The processing of synaptic information by nerve cells is a function of the morphology and the electrical properties of the membrane and cytoplasm, as well as the site and properties of the synaptic inputs that impinge onto the dendritic tree. The theoretical studies of W. Rall (1959, 1964, 1967, 1977) have shown that in a passive tree the relevant parameters that determine this processing are the input resistance of the cell (RN), the average electrotonic length of the dendrites (LA"), the system time constant ( T ~ ) ,the rise time (tpeak) and magnitude of the synaptic conductance change, as well as the reversal potential (Esyn) of the synaptic current. Rall's studies, followed by the study of Jack and Redman (1971a), suggested experimental methods for estimating these parameters. Consequently, several neuron types were characterized based on these parameters (e.g., Jack and Redman 1971b; Barrett and Crill 1974; Brown et al. 1981; Stratford et al. 1989; Nitzan et al. 1990). These studies were performed either on anesthetized animals or on slice preparations where most of the background synaptic activity is absent. However, neurons in the CNS are part of a large network and, therefore, each neuron receives thousands of synaptic inputs which may have ongoing activity of a few spikes per second. Such a massive input will generate a sustained conductance change which will modify the integrative capabilities of the neuron (Holmes and Woody 1989). The present study explores the effect of this conductance change on the integrative properties of the cell, using detailed cable and compartmental models of a morphologically and physiologically characterized Purkinje cell (PC) from the guinea pig cerebellum (Rapp 1990; Segev et al. 1991). This cell has a large and complex dendritic tree (Fig. 1) with several hundred spiny branchlets (Fig. 2) that are densely studded with spines. Each of these spines receives an excitatory (asymmetrical) synaptic input from the parallel fiber (p.f.) system, summing to a total of 100,000 p.f.s impinging on a single PC. In such a large system, even a low spontaneous activity rate of each of the p.f.s will generate a high frequency of synaptic input to the PC. The consequences of such an input for the input/output properties of the PC are discussed. N

2 Model

Simulations were performed using cable and compartmental models of the cell shown in Figure 1. The cable properties of the same cell were estimated from intra-somatic recordings performed in the presence of 2.5 mM Csf ions that abolish the marked subthreshold membrane rectification of these cells (Crape1 and Penit-Soria 1986; Rapp 1990; Segev et al. 1991). The process of utilizing the model to estimate the specific membrane resistivity (R,) and capacitance (CJ, and the specific

520

M. Rapp, Y. Yarom, and I. Segev

Figure 1: The modeled Purkinje cell. This cell was reconstructed following an intracellular injection of horseradish peroxidase. The cable parameters of the cell were also characterized in the presence of 2.5 mM Cs+ ions to abolish the marked rectification that exists near the resting potential. The input resistance ( R N )was 12.9 MR and the system time constant (TOO) was 46 msec. Assuming a total spine number of 100,000, each with an area of 1 pm2, the soma-dendritic surface area sums to 149,500 pm2. The optimal matching between the morphology and the cable properties of the cell implies that Rm is 440 R.cm2 at the soma and 110,000 R . cm2 at the dendrites, Cm is 1.64 pF/cm2 and Xi is 250 R . cm. Figure 2: Facing page. Sholl diagram, in morphological units, of the cell shown in Figure 1. This cell was represented by 1500 cylindrical segments and a spherical soma. Red lines denote the spiny branchlets that are studded with approximately 10 spines per 1 pm dendritic length. Each of these spines receives an excitatory input from a parallel fiber. These branchlets consist of 85% of the dendritic area (without the spines). The cell consists of 473 dendritic terminals, some of which terminate less than 100 pm away from the soma whereas others extend to a distance close to 400 pm from the soma. Because the parallel fibers impinge only on the spiny branchlets, these branchlets are electrically "stretched" when the parallel fibers are active.

Modulation of Cable Properties

521

resistivity of the cytoplasm (RJ is elaborated in Segev et al. (1989) and Nitzan et al. (1990). For the PC under study the optimal match between the cell morphology and electrical measurements was obtained with R, = 440 R . cm2 at the soma and 110,000 R . cmz at the dendrites, a uniform

I

0

100

200

300 (Pm>

M. Rapp, Y. Yarom, and I. Segev

522

C, of 1.64 pF/cm2 and R; of 250 R . cm. These values yielded an input resistance, RN, of 12.9 MR and a system time constant, TO, of 46 msec, both matching the corresponding experimental values. Morphological measurements of the cell in Figure 1 showed that the total length of dendritic branchlets bearing spines is 10,230 pm. These branchlets are marked by the red lines of Figure 2, which demonstrates Sholl diagram of the cell in Figure 1. The density of spines on the spiny branchlets has been reported to range between 10 and 14 per 1 pm dendritic length (Harris and Stevens 1988a,b); we estimated the total number of spines in the modeled tree to be 100,000, each with an area of 1 pm2 (Harris and Stevens 1988a,b). The total soma-dendritic surface area (including spines) of the modeled neuron was 149,500 pm2. In most of the simulations spines were incorporated into the model globally, rather than by representing each spine as a separate segment, since the latter would be computationally very slow. It can be shown that, for plausible values of R,, Ri, and C m , when current flows from the dendrite into the spine, the membrane area of the spine can be incorporated into the membrane of the parent dendrite (Fig. 3A). The reason for this is that for this direction of current flow, the spine base and the spine head membranes are essentially isopotential (Jack et al. 1975; Rall and Rinzel 1973; Segev and Rall 1988). This approximation is valid when , the average cable length of the cable parameters such as RN, T ~ and dendrites, La,, are estimated from the model. It does not hold when the voltage response of the p.f. input impinging onto the spine head is of interest. In this case spines receiving synaptic input were represented in full (Fig. 38). To incorporate the spines globally into the model we modified the method suggested by Holmes (19891. In this method the original dimensions of the parent dendrite are preserved and the membrane area of the spines is effectively incorporated into the dendritic membrane by changing the values of R, and C,. When both spines and parent dendrite have the same specific membrane properties (R, and C,), the transformed values are:

Rh

= R,

fF

and

Ch

= C,F

and (2.1) where areadendis the area of the parent dendrite without the spines and areaspinesis the membrane area of the spines at that dendritic segment. Note that this transformation preserves the time constant of the original spiny dendrite. The present study focuses on the effect of the conductance change induced by the p.f. input on the cable properties of the PC. In this case the effective membrane resistivity at the spine heads receiving the input (RmqadSpinH) is reduced as compared to the membrane resistivity of

Modulation of Cable Properties

523

A When measuringcable propenies ('To, R,.

R AF), the membrane area of both passive and synaptically activated spines IS Incorporatedinto the parent dendrite membrane

... ,o.*,

B

.1

I I 1,

............................ ......................................... I I

1 1 , , , , , , 1 1 1 1 1 1 , 1 , , , , , , 1 1

m e n measuring synaptic potential. spines receiving synaptic input remain "unincorporated

Figure 3: Schematic representation of the method used to globally incorporate dendritic spines into the membrane of the parent dendrite. (A) The case where the current flows from the dendrite into the spines (arrows in left schematic). (B) The case where current flows from the spines into the dendrite. In (A), equation 2.2 is first used to calculate an effective R, value for the whole segment (middle schematic); then equation 2.1 is utilized for calculating new specific values for the membrane of the parent branch to effectively incorporate the spine area into the parent dendrite area. In (B),spines receiving synaptic input remain unincorporated, whereas passive spines are incorporated into the parent dendrite membrane as in (A).

the nonactivated spines and of the parent segment Assuming that the input can be approximated by a steady-state conductance change (see below), the effective time constant of the spiny segment (dendrite plus spines) is between the time constant of the dendritic membrane . C), and that of the activated spines membrane (Rm,actspinesC,). Now the first step in utilizing equation 2.1 is to find a single R; value for the whole dendritic segment (parent dendrite plus all spines) such that (R; . C), approximates the effective time constant of the original (nonhomogeneous) segment. Our calculations show that for an electrically short spiny segment, R; is the reciprocal of the sum of the specific +

M. Rapp, Y. Yarom, and I. Segev

524

conductances of the two membrane types, weighted by their relative area: 1

.:=I/[(&)(-)+(

areatotal

Rrn,actspines

) (areaactspines)] areatotal

(2.2)

where arearestis the membrane area of the parent dendrite and of the nonactivated spines, areaactspines is the area of the activated spines, and areatotal= (area,,, areaactspines) (see also Fleshman et al. 1988). Having a single Rh value as calculated in equation 2.2 for the whole segment (Fig. 3A, middle panel), equation 2.1 can then be used (with R; replacing R,) for incorporating the spines area into the parent dendrite area (Fig. 3A, right panel). Transient (compartmental) models were studied using SPICE (Segev et al. 1985, 1989). These models were utilized to simulate the voltage response of the parallel fiber input. To facilitate the computations, only a representative group of 200 randomly selected spines were modeled individually. In each run a different group of 200 such spines was selected. The rest of the spines were globally incorporated into the parent branchlets membrane as discussed above. Each of the representative spines was simulated by an R-C circuit, modeling the passive membrane properties of the spine, and an additional synaptic path consisting of a transient conductance change, gsyn(t),in series with a synaptic battery, Esyn. This compartment was connected to the parent dendrite by an axial resistance, representing the longitudinal resistivity of the spine neck (Segev and Rall 1988). The representative spines were activated synchronously at a high frequency, w, such that w = N .0/200 where N is the number of p.f.s and 6 is the original (low) firing frequency of the p.f.s. For example, if each of the N = 100,000 p.f.s is activated at 0 = 2 Hz (200,000 inputs/sec), then each of the 200 representative spines is activated once every msec (w = 1000 Hz), thus preserving the total number of inputs per unit time. The validity of this approximation was examined by increasing the number of representative spines to 400 and decreasing w by a factor of 2. These changes resulted in only minor differences in both the cable properties and the membrane voltage produced at the modeled cell. Also, the input resistance, RN, as calculated using SPICE matched the analytical results of the cable models (see below). These control tests have led us to conclude that for the range of input frequencies tested, because of the large number of inputs impinging on the tree, the results of the present study are essentially independent of the exact timing of the input and/or the location of the activated spines, provided that the total conductance change per unit time is preserved. Intuitively, a dendritic tree which is bombarded by a massive transient synaptic input must experience approximately a steady conductance change (Rall 1962). This effective conductance change, gsteady, can then be utilized in steady-state (segmental) cable models to compute analytically the impact of the synaptic activity on the cable properties of the cell.

+

Modulation of Cable Properties

525

For each transient synaptic input, gsyn(f),activated at a frequency 8, the effective steady conductance change is

1

W

gsteady

6

(2.3)

gsyn(f)dt

When 200 representative spines were used, w replaced 8 in equation 2.3. In that case, each of these spines has an effective R, value, (2.4) where areaspineis the membrane area of the spine and grest= (areaspine/ R,,rest) is the resting conductance of the spine membrane (without the synaptic input). This Rm,actspinesvalue, over an area, areaactspines = 200 pm2 (200 spines), was utilized in equation 2.2 to calculate the effective R; value of the spiny branchlets, with areatotalbeing the total membrane area of ull spiny brunchlets plus spines. In this way the total synaptic conductance was equally distributed over the whole membrane surface of the spiny branchlets. In our computations, gsyn(f)was modeled with an ”alpha function,” g s y n ( t ) = gmax(f/tpeak)exp{l

-

(t/tpeak))

(2.5)

where gmaX is the peak synaptic conductance change and tpeak is the rise time. For the alpha function the integral in equation 2.3 is

1

00

gsyn(f)d t = grnaxtpeak exp(1)

(2.6)

With the values used in the present study @,ax = 0.4 nS; fpe& = 0.3 msec), this integral is 0.33 nS.msec. For 6 = 2 Hz (w = 1000 Hz) we get from equation 2.3 that gsteady = 0.33 nS. Thus, with a spine area of 1 pm2 and R,,,,,, of 110,000 R . cm2, equation 2.4 implies that the effective membrane resistivity of each of the activated spines is as small as 30 R . cm2. Utilizing equation 2.2 with this value (with arearest= 144,538 pm2 and areaactspines = 200 lm2), the effective R; at the spiny brunchlets was 18,140 R . cm2 (rather than 110,000 R . cm2),suggesting that the activation of the p.f.s have a marked effect on the cable properties of the PC. Having R, and Ri values for the tree, the algorithm developed by Rall (1959) was implemented to analytically calculate the effect of p.f. activation on the soma input resistance (RN),the average input resistance at the terminals (RT), the average cable length of the dendrites (Lon),and the average voltage attenuation factor from the dendritic tips into the soma AFT+^); see also Nitzan et al. (1990). The modifications needed to include nonuniform R, distribution is given in Fleshman et ul. (1988). The system time constant ( T ~was ) computed by “peeling” the voltage response to a brief current pulse produced by the corresponding compartmental model (Rall 1969).

526

M. Rapp, Y. Yarom, and I. Segev

The values for the unitary synaptic conductance change are based on the patch-clamp studies by Hirano and Hagiwara (1988) and Llano et al. (1991). The value for g,, was estimated from the minimal peak of the current produced by the p.f. input whereas the value for tpeak was estimated from the experimental rise time (measured at 22°C) corrected to 37"C, assuming a Qlo of 3. 3 Results

The voltage responses of the modeled PC to the p.f. input are shown in Figure 4 for an input frequency (0) of 0.5 Hz (Fig. 4A) and 5 Hz (Fig. 4B). In each of these panels the continuous lines depict the voltage response at an arbitrarily chosen spine head (top trace), at the base of this spine (middle trace), and at the soma (lower trace). The bottom panels (C and D) show the corresponding input into each of the 200 representative spines (w = 250 Hz in Fig. 4C and 2500 Hz in Fig. 4D). Dashed lines in A and B are the voltage responses at the corresponding sites when a steady state conductance change (gsteady), as defined in equation 2.3, is imposed on each of the 200 representative spines. This steady input shown by the dashed line in the corresponding lower panels is 0.08 nS in Figure 4C and 0.8 nS in Figure 4D (in the latter case the dashed line is masked by the heavy line). Unlike the case of relatively low input frequency (Fig. 4C) where each transient input is seen individually, at high frequency (Fig. 4D) the temporal summation of individual inputs resulted in the heavy line. The figure demonstrates that already at a low firing rate of 0.5 Hz (Fig. 4A) the p.f. input produces a peak depolarization of approximately 15 mV at the spine head, 12 mV at the spine base, and 8 mV at the soma. At 5 Hz (Fig. 481, the maximal depolarization at the spine head is 45 mV, 42 mV at the spine base, and 30 mV at the soma. An important difference in the voltage produced by the two input frequencies is in the rate of the voltage buildup. At a low frequency (Fig. 4A), the depolarization reaches a steady-state value after more than 100 msec whereas at high frequency (Fig. 4B) the steady-state value is reached after approximately 50 msec (see Discussion). Figure 4 also shows that the steady-state approximation (dashed lines) faithfully reproduces the voltage response of the cell already at the low firing rate of 0.5 Hz (Fig. 4A). The agreement between the results of the transient input and the steady-state input implies that, indeed, the depolarization along the tree is essentially a function of w . g,, . fpeak (equations 2.3 and 2.5). Thus, w,g,,,, and t p a k are interchangeable. Another point to note is that, unlike the case of a localized synaptic input where a significant voltage attenuation is expected, when a distributed input is activated the depolarization along the soma-dendritic surface is rather uniform (compare the voltage at the spine head and at the soma).

Modulation of Cable Properties

527

B

A

hI time (msec)

time (msec)

D

C

0.2

1

O.O

A

0.4

1

10.8 - j

0.3

h

h

rn

v) 0.8

5 0 . 2

C

3M 0.4

3

0.1

0.0

time (msec)

I

50

75

id0

time (msec)

Figure 4: Depolarization produced by background activity of the parallel fibers. (A,C), 8 = 0.5 Hz; (B,C), 8 = 5 Hz. Bottom frames show the conductance input into each of the 200 representative spines modeled in full. Each of these spines receives transient conductance input at a frequency w = 8 (100,000/200) = 5008; individual inputs are clearly seen at the low firing rate (C).At the higher frequency (D) the transient synaptic conductance changes are summed-up to produce the heavy line. Dashed line in (C) and (D) [masked by the heavy line in (D)] shows the corresponding steady-state conductance change as calculated from equation 2.3. Upper panels (A and B) show the resultant depolarization at the head of one of the activated spines (upper curve), at its base (middle curve) and at the soma (lower curve). Dashed lines show the results when a steady-state conductance change is used. Note the excellent agreement between the results using transient inputs and the results obtained with a steady-state conductance change.

How d o the different cable properties of the cell and the soma depolarization depend on the frequency of the p.f. input? Figure 5A shows

M. Rapp, Y. Yarom, and I. Segev

528

B

A 0.4

1

*

3 4 frequency (Hz)

, . o h

I

5 1

D

C 151

o

?

h r ! 9 4 5 frequency (Hz)

ohfrequency (Hz)

Figure 5: The cable properties of the PC critically depend on the input frequency (8) of the parallel fibers. The graphs show this dependence on 8 for (A) the depolarization produced at the PC soma, (B)the average electrotonic length of PC dendrites, (C) the soma input resistance, and (D), the system line constant. The points in (A) were calculated using SPICE; the points in ( E D ) were calculated analytically. These changes will influence the efficacy of the p.f. input. that the depolarization at the soma changes steeply at low frequencies

and tends to saturate at higher frequencies. The saturated value of somatic depolarization induced by the p.f. input was found to be $Esyn, namely 40 mV for the parameters chosen in the present study. The effect of the p.f. input frequency on the average cable length (LJ of the dendritic tree is shown in Figure 5B. Due to the high R, value at rest (at 0 = 0 Hz) L,, is 0.13. The tree is electrically "stretched" into L,, = 0.31 (a factor of 2.4) at 5 Hz. At this range of frequencies the input resistance at the soma (Fig. 5C) is decreased from 12.9 MS1 at

Modulation of Cable Properties

529

0 Hz to 6.5 MR at 5 Hz (50%),whereas the average input resistance at the terminal tips (RT)is reduced by only 20% (from 104 to 83 MR, not shown). The system time constant (Fig. 5D) is the most sensitive parameter; it is reduced from 46 to 12.1 msec (a factor of 3.8, see also Rall 1962). This implies that, at 5 Hz, the effective membrane conductance is four times larger than the membrane conductance at 0 Hz. Another parameter that was calculated (not shown) is the average steady state attenuation factor (AFT-5) from the terminal tips to the soma. This parameter is increased from 8.8 (at 0 Hz) to 28 (at 5 Hz), a factor of 3.2. For the explicit relation between RT and AFT,s (see Rall and Rinzel 1973). We conclude that the background activity of the p.f.s significantly changes the cable properties of the PC dendrites with the following relative sensitivity: ro > AFT,^ > Lo, > RN > RT. The somatic depolarization resulting from p.f.s activation rises steeply as a function of the input frequency. Already at relatively low firing frequency of 5 Hz it reaches 3/4 of the maximal depolarization that the p.f.s can produce at the Pc's soma (Fig. 5A). This depolarization develops rather smoothly with a rate that increases as the input frequency increases.

4 Discussion

The present study demonstrates that the background activity of the parallel fibers has a dramatic effect on the functional properties of cerebellar Purkinje cells. Already at a low firing rate of a few Hz, the membrane conductance of the PC significantly increases. As a result, both the system time constant, 70, and the input resistance, RN, decrease by several fold, whereas the electronic length, L,,, and the voltage attenuation factor, AFT,^ (not shown) increase (Fig. 5E!-D). This background activity is expected to significantly depolarize the PC (Fig. 4A & B and Fig. 5A). The effect of the background activity on the cable properties of the cell strongly depends on the time-integral of the synaptic conductance change and on the frequency of the background activity (equation 2.3). The same general results hold also for other neurons from the mammalian CNS receiving a large number of synaptic contacts, each of which may be activated spontaneously at a frequency of a few spikes/sec. Indeed, similar conclusions have been recently reached by Bernander et al. (1991) who modeled the effect of background activity on the cable properties of a reconstructed layer V pyramidal cell in the visual cortex. We therefore suggest that the effective cable properties and the "resting potential" of these neurons in the behaving animal are different from those measured in the slice preparation or in anesthetized animals (Holmes and Woody 1989; Abbott 1991; Amit and Tsodyks 1992). The results of the present study have several interesting implications for the integrative capabilities of central neurons. The massive background activity (and the corresponding increase in the membrane con-

530

M. Rapp, Y. Yarom, and I. Segev

ductance) implies that single p.f. inputs essentially loose their functional meaning and only the activation of a large number of p.f.s will significantly displace the membrane potential. It should be noted, however, that other more efficient individual inputs (which may also contact a different dendritic region and do not participate in the background activity) can have a marked effect on the input/output characteristics of the PC. An example is the powerful climbing fiber input which forms as many as 200 synaptic contacts on the soma and main dendrite of the Purkinje cell. When activated, the whole dendritic tree and soma are strongly depolarized; this produces a short burst of high-frequency firing at the PC axon (Llinas and Walton 1990). As demonstrated in Figure 5A, the soma depolarization (the excitatory synaptic current reaching the soma) is a nonlinear function of the input firing rate. The higher the frequency of the background activity, the larger the number of excitatory synapses that need to be recruited to depolarize the cell by a given amount. This is in contrast to the presently used neuron network models, where the current input into a modeled "cell" is assumed to be linearly related to the input firing rate. The saturation of the soma depolarization at relatively low firing rates (Fig. 5A) implies a narrow dynamical range for the detection of changes in p.f. input frequency. Figure 4A and B show that for any given input frequency, several tens of milliseconds are required for the voltage to reach a steady-state value. Therefore, changes in the frequency of the p.f. input will be "detected only if the change lasts for a relatively long period. Furthermore, because of the change in 70, the time course of the voltage change is a function of the input frequency. The higher the input frequency, the faster the build-up of the potential toward its steady value. Hence, at higher background frequencies, more synapses are required to shift the membrane potential by a given amount, but the time course of this shift is faster. For example, suppose that the frequency of the background activity of the p.f.s is 1 Hz (100 synapses/msuc); the resulting depolarization (relative to 0 Hz) is 13 mV (Fig. 4A). Increasing the frequency to 2 Hz (an additional 100 synapses/msec) will further depolarize the soma by 8 mV, provided that the frequency change lasts at least 50 msec (about 2.5 times T~ corresponding to 2 Hz; Fig. 4D). These dynamic alternations in the depolarization of the PC soma will modulate the cell firing rate. Inhibitory inputs onto the PC originate primarily from the stellate cells (dendritic inhibition) and from the basket cells (mainly somatic inhibition). Since our model has a very leaky soma (low somatic Rm), the basket cell input is essentially built-in to the model. The number of stellate cells that contact a single PC is much smaller than the number of p.f.s. It is expected, therefore, that the background activity onto the PC will be dominated by the activity of the p.f.s. Our preliminary simulations suggest that the inhibition induced by the stellate cells can act only locally,

Modulation of Cable Properties

531

at a given dendritic region, and have only a minor effect on the somatic membrane potential produced by the p.f. activity. Finally, it has been clearly shown that the PC's dendrites are endowed with a variety of voltage-sensitive channels (LlinAs and Sugimori 1980a,b). In response to synaptic inputs, these channels produce both subthreshold nonlinearities as well as full blown dendritic spikes. The effect of these nonlinearities on the results of the present report will be explored in a future study.

Acknowledgments This work was supported by a grant from the Office of Naval Research and a grant from the Israeli Academy of Sciences. We thank our colleague Shaul Hochstein for critical comments on this work.

References Abbott, L. F. 1991. Realistic synaptic input for model neural networks. Network (in press). Amit, D. J., and Tsodyks, M. V. (1992). Effective neurons and attractor neural networks in cortical environment. Network (in press). Barrett, J. N., and Crill, W. E. 1974. Specific membrane properties of cat motoneurones. J. Physiol. (London) 293,301-324. Bernander, O., Douglas, R. J., Martin, K. A. C., and Koch, C. 1992. Synaptic background activity determines spatio-temporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. U S A . 88, 11569-11573. Brown, T. H., Fricke, R. A., and Perkel, D. H. 1981. Passive electrical constants in three classes of hippocampal neurons. J. Neurophysiol. 46, 812-827. Crapel, F., and Penit-Soria, J. 1986. Inward rectification and low threshold calcium conductance in rat cerebellar Purkinje cells. J. Physiol. (London) 372, 1-23. Fleshman, J. W., Segev, I., and Burke, R. E. 1988. Electrotonic architecture of type-identified Q motoneurons in the cat spinal cord. J. Neurophysiol. 60, 60-85. Harris, K. M., and Stevens, J. K. 1988a. Dendritic spines of rat cerebellar Purkinje cells: Serial electron microscopy with reference to their biophysical characteristics. J. Neurosci. 12, 455469. Harris, K. M., and Stevens, J. K. 198813. Study of dendritic spines by serial electron microscopy and three-dimensional reconstruction. Neurol. Neurobioi. 37, 179-199. Hirano, T., and Hagiwara, S. 1988. Synaptic transmission between rat cerebellar granule and Purkinje cells in dissociated cell culture: Effects of excitatoryamino acid transmitter antagonists. Proc. Natl. Acad. Sci. U.S.A. 85, 934-938. Holmes, W. R. 1989. The role of dendritic diameter in maximizing the effectiveness of synaptic inputs. Brain Res. 478, 127-137.

532

M. Rapp, Y. Yarom, and I. Segev

Holmes, W. R., and Woody, C. D. 1989. Effect of uniform and non-uniform synaptic "activation-distribution" on the cable properties of modeled cortical pyramidal cells. Brain Res. 505, 12-22. Jack, J. J. B., Noble, D., and Tsien, R. W. 1975. Electrical Curretit Flow in Excitable Cells. Oxford University Press, UK. Jack, J. J. B., and Redman, S. J. 1971a. The propagation of transient potentials in some linear cable structures. J. Physiol. (London) 215, 283-320. Jack, J. J. B., and Redman, S. J. 1971b. An electrical description of the motoneuron, and its application to the analysis of synaptic potentials. J. Physiol. (Lottdon) 215, 321-352. Llano, I., Marty, A., Armstrong, C. M., and Konnerth, A. 1991. Synaptic and agonist-induced excitatory current of Purkinje cells in rat cerebellar slices. J. Physiol. (London) 431, 183-213. Llinb, R. R., and Sugimori, M. 1980a. Electrophysiological properties of in vitro Purkinje cell somata in mammalian cerebellar slices. J. Physiol. (London) 305, 171-195. Llinbs, R. R., and Sugimori, M. 1980b. Electrophysiological properties of in vitro Purkinje cell dendrites in mammalian cerebellar slices. J. Physiol. (London) 305, 197-213. LlinAs, R. R., and Walton, K. D. 1990. Cerebellum. In The Synaptic Organization of the Brain, G. M. Shepherd, ed., pp. 214-245. Oxford University Press, Oxford. Nitzan, R., Segev, I., and Yarom, Y. 1990. Voltage behavior along irregular dendritic structure of morphologically and physiologically characterized vagal motoneurones in the guinea pig. J. Neurophysiol. 63, 333-346. Rall, W. 1959. Branching dendritic trees and motoneuron membrane resistivity. Exp. Neurol. 1, 491-527. Rall, W. 1962. Theory of physiological properties of dendrites. Ann. NY Acad. Sci. 96, 1071-1092. Rall, W. 1964. Theoretical significance of dendritic trees for neuronal inputoutput relations. In Neural Theory and Modeling, R. Reiss, ed., pp. 73-97. Stanford University Press, Stanford. Rall, W. 1967. Distinguishing theoretical synaptic potentials computed for different soma-dendritic distributions of synaptic input. J. Neurophysiol. 30, 1138-11 68. Rall, W. 1969. Time constants and electronic length of membrane cylinders in neurons. Biophys. J . 9, 1483-1168. Rall, W. 1977. Core conductor theory and cable properties of neurons. In Handbook of Physiology, Vol. 1, Pt. 1. The Nervous System, E. R. Kandel, ed., pp. 39-97. America1 Physiology Society, Bethesda, MD. Rall, W., and Rinzel, J. 1973. Branch input resistance and steady attenuation for input to one branch of a dendritic neuron model. Biophys. J. 13, 648-688. Rapp, M. 1990. The passive cable properties and the effect of dendritic spines on the integrative properties of Purkinje cells. M.Sc. Thesis, submitted to the Hebrew University, Jerusalem. Segev, I., and Rall, W. 1988. Computational study of an excitable dendritic spine. J. Neurophysiol. 60, 499-523.

Modulation of Cable Properties

533

Segev, I., Fleshman, J. W., Miller, J. I?, and Bunow, B. 1985. Modeling the electrical properties of anatomically complex neurons using a network analysis program: Passive membrane. Biol. Cyber. 53, 27-40. Segev, I., Fleshman, J. W., and Burke, R. E. 1989. Compartmental models of complex neurons. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds., pp. 171-194. Bradford Books, Cambridge, MA. Segev, I., Rapp, M., Manor, Y., and Yarom, Y. 1992. Analog and digital processing in single nerve cells: Dendritic integration and axonal propagation. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds. Academic Press, Orlando, FL (in press). Stratford, K., Mason, A., Larkman, A., Major, G., and Jack, J. 1989. The modelling of pyramidal neurons in the visual cortex. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, eds., pp. 296-321. Addison-Wesley, Wokingham, England.

Received 23 July 1991; accepted 12 December 1991.

This article has been cited by: 2. Mark E. Nelson. 2010. Electrophysiological models of neural processing. Wiley Interdisciplinary Reviews: Systems Biology and Medicine n/a-n/a. [CrossRef] 3. Zhijun Yang, Matthias H. Hennig, Michael Postlethwaite, Ian D. Forsythe, Bruce P. Graham. 2009. Wide-Band Information Transmission at the Calyx of HeldWide-Band Information Transmission at the Calyx of Held. Neural Computation 21:4, 991-1017. [Abstract] [Full Text] [PDF] [PDF Plus] 4. I. B. Kulagina. 2009. Impact of Structural Characteristics of Reconstructed Motoneurons on Their Excitability (a Simulation Study). Neurophysiology 41:2, 116-121. [CrossRef] 5. M. Ozer, L. J. Graham. 2008. Impact of network activity on noise delayed spiking for a Hodgkin-Huxley model. The European Physical Journal B 61:4, 499-503. [CrossRef] 6. Arvind Kumar, Sven Schrader, Ad Aertsen, Stefan Rotter. 2008. The High-Conductance State of Cortical NetworksThe High-Conductance State of Cortical Networks. Neural Computation 20:1, 1-43. [Abstract] [PDF] [PDF Plus] 7. M. Rudolph , A. Destexhe . 2003. Characterization of Subthreshold Voltage Fluctuations in Neuronal MembranesCharacterization of Subthreshold Voltage Fluctuations in Neuronal Membranes. Neural Computation 15:11, 2577-2618. [Abstract] [PDF] [PDF Plus] 8. David M. Halliday . 2000. Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal BandwidthWeak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth. Neural Computation 12:3, 693-707. [Abstract] [PDF] [PDF Plus] 9. Amit Manwani , Christof Koch . 1999. Detecting and Estimating Signals in Noisy Cable Structures, I: Neuronal Noise SourcesDetecting and Estimating Signals in Noisy Cable Structures, I: Neuronal Noise Sources. Neural Computation 11:8, 1797-1829. [Abstract] [PDF] [PDF Plus] 10. Richard Kempter, Wulfram Gerstner, J. van Hemmen. 1999. Hebbian learning and spiking neurons. Physical Review E 59:4, 4498-4514. [CrossRef] 11. Richard Kempter , Wulfram Gerstner , J. Leo van Hemmen , Hermann Wagner . 1998. Extracting Oscillations: Neuronal Coincidence Detection with Noisy Periodic Spike InputExtracting Oscillations: Neuronal Coincidence Detection with Noisy Periodic Spike Input. Neural Computation 10:8, 1987-2017. [Abstract] [PDF] [PDF Plus] 12. Gary R. Holt, Christof Koch. 1997. Shunting Inhibition Does Not Have a Divisive Effect on Firing Rates*Shunting Inhibition Does Not Have a Divisive

Effect on Firing Rates*. Neural Computation 9:5, 1001-1013. [Abstract] [PDF] [PDF Plus] 13. G. L. Yuen, P. E. Hockberger, J. C. Houk. 1995. Bistability in cerebellar Purkinje cell dendrites modelled with high-threshold calcium and delayed-rectifier potassium channels. Biological Cybernetics 73:4, 375-388. [CrossRef]

14. P C Bressloff. 1995. Journal of Physics A: Mathematical and General 28:9, 2457-2469. [CrossRef] 15. Paul Bressloff. 1995. Spatiotemporal processing in neural networks with random synaptic background activity. Physical Review E 51:5, 5064-5073. [CrossRef] 16. Bartlett W. Mel . 1994. Information Processing in Dendritic TreesInformation Processing in Dendritic Trees. Neural Computation 6:6, 1031-1085. [Abstract] [PDF] [PDF Plus] 17. J. D. Evans, G. C. Kember. 1994. Analytical solutions to the multicylinder somatic shunt cable model for passive neurones with differing dendritic electrical parameters. Biological Cybernetics 71:6, 547-557. [CrossRef] 18. Andrew A. V. Hill, Donald H. Edwards, Rodney K. Murphey. 1994. The effect of neuronal growth on synaptic integration. Journal of Computational Neuroscience 1:3, 239-254. [CrossRef] 19. L. F. Abbott. 1994. Decoding neuronal firing and modelling neural networks. Quarterly Reviews of Biophysics 27:03, 291. [CrossRef] 20. P C Bressloff. 1994. Journal of Physics A: Mathematical and General 27:12, 4097-4113. [CrossRef] 21. Mark E. Nelson . 1994. A Mechanism for Neuronal Gain Control by Descending PathwaysA Mechanism for Neuronal Gain Control by Descending Pathways. Neural Computation 6:2, 242-254. [Abstract] [PDF] [PDF Plus] 22. Paul Bressloff, John Taylor. 1993. Spatiotemporal pattern processing in a compartmental-model neuron. Physical Review E 47:4, 2899-2912. [CrossRef]

Communicated by Lany Abbott

Activity Patterns of a Slow Synapse Network Predicted by Explicitly Averaging Spike Dynamics John Rinzel Mathematical Research Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Befhesda,M D 20892 U S A

Paul Frankel Division of Applied Mathematics, Brown University, Providence, RI 02912 USA and Mathematical Research Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes ofHealth, Bethesda, M D 20892 USA

When postsynaptic conductance varies slowly compared to the spike generation process, a straightforward averaging scheme can be used to reduce the system’s complexity. Our model consists of a HodgkinHuxley-like membrane description for each cell; synaptic activation is described by first order kinetics, with slow rates, in which the equilibrium activation is a sigmoidal function of the presynaptic voltage. Our work concentrates on a two-cell network and it applies qualitatively to the activity patterns, including bistable behavior, recently observed in simple in vitro circuits with slow synapses (Kleinfeld e t al. 1990). The fact that our averaged system is derived from a realistic biophysical model has important consequences. In particular, it can preserve certain hysteresis behavior near threshold that is not represented in a simple ad hoc sigmoidal input-output network. This behavior enables a coupled pair of cells, one excitatory and one inhibitory, to generate an alternating burst rhythm even though neither cell has fatiguing properties. 1 Introduction

When modeling the dynamic activity of cell ensembles descriptions are often used that do not account for action potential generation (e.g., Hopfield and Tank 1986).Cell activity might be represented by instantaneous spike frequency, or as mean (time-averaged) membrane potential. Although one assumes that a justifiable procedure for averaging over the spikes can be carried out explicitly, activity equations, based on sigmoidal Neural Computation

4, 534-545 (1992)

An Averaged Model for a Slow Synapse Network

535

input-output relations, are usually presented ad hoc without direct connection to the biophysical properties of neurons. Here we consider a situation in which averaging can be performed systematically starting from the bottom up. In various neurons postsynaptic responses are much slower than spikes (Adams ef al. 1986; Kleinfeld ef al. 1990; Syed et al. 1990). We exploit these differing time scales in order to average explicitly the action potential dynamics and thereby derive a simplified model that involves only the slowly varying synaptic activation variables. We apply our model and use phase plane methods to describe nonlinear behavior seen in recent experiments with two-cell in vitro invertebrate networks (Kleinfeld et al. 1990). For example, neurons that are mutually inhibitory can exhibit bistable behavior in which one cell is firing and the other not, or vice versa (Section 3). Switching between these two modes can be induced with current stimuli that are not too brief. Our model cells, like the real neurons (Kleinfeld et al. 1990), exhibit sustained repetitive firing but not endogenous bursting. Yet we find, surprisingly, that an inhibitory cell and an excitatory cell may interact via slow synapses in order to generate an oscillatory burst pattern (Section 4). In our case, this rhythm occurs because an isolated cell’s input-output (I/O) relation can exhibit hysteresis: for a certain range of input intensity, the cell membrane can either fire repetitively or be stationary, as can the standard Hodgkin-Huxley (HH) model (Rinzel 1978). The alternating burst pattern does not depend on a fatiguing or postinhibitory rebound mechanism (Perkel and Mulloney 1974), nor on synaptic self-feedback (Wilson and Cowan 1972). It would not be predicted with an ad hoc model employing sigmoidal 1/0 relations. 2 Averaging an HH Model with Slow Synapses

We consider a two-cell system (each cell may synapse on the other but not on itself) whose model equations are:

it,

= - ~ , o n ( ~ 1H ,I ) - 811 S]l

m(vd &I

=

4,

=

G

@I)

-

(vl- V q m ) + 1,

n,

A, km(v,)- sl11,

i, j = 1 , 2 (i # j )

(2.1) (2.2) (2.3)

The first two equations (2.1-2.2) for each cell constitute an HH-like model for spike behavior; I , o n ( n~ ), = g N a m3, (v) (v - vNa) + g K n4 (v - vK) + gL (v - DL) contains the voltage-activated inward (say, sodium) and outward (potassium) currents and the leakage current. Sodium activation, rn, is assumed so fast that it achieves steady state instantaneously: m = rn,(v) = l/{l+exp[(Om-v)/km]}.For potassium activation, n, the voltagedependent ”steady-state” and “time constant” functions are given, respectively, by n,(v) = 1/{1+exp[(O, -v)/k,]} and 7, (v) = 7,/cosh[(0.2v)/0.3]. The postsynaptic activation, sIl, in cell j specifies the fraction

536

John Rinzel and Paul Frankel

of maximal conductance, gi,; it obeys first order kinetics in which the steady-state level, s,(D) = 1/{1+exp[2(B,, -v)/ksm]}, depends on presynaptic voltage of cell i. The equations and variables are dimensionless. Voltages have been divided by the sodium reversal potential, V N ~ .An effective membrane time constant, T,, is defined by using a typical conductance, C,. Time is scaled relative to r,,,, and all conductances are relative to G,. The applied current Ii is dimensionless after dividing by C,VN,. Parameter values are listed in the legend to Figure 1. Our model of synaptic activation dynamics is highly idealized, and it does not distinguish whether the rate-limiting step is in the postsynaptic channel kinetics, in transmitter binding or in handling processes in the cleft, or in the release mechanism. Nevertheless, results obtained with the simplified description agree qualitatively with exploratory simulations using more detailed models. Here, by "slow" synapses we mean that rs is large relative to 1 and 7., Thus, since s,, changes slowly compared to D and n we proceed as follows. First, we determine the I/O properties of each spike-generating subsystem with s,, treated as a parameter. Then, in equation 2.3, we replace sm(ui) by its average, S,, over the spike activity of cell i (which depends on sji). This average is defined by (2.4)

where v,(t;sj,)is the oscillating time course for a periodic solution (period T ) to equations 2.1-2.2 with slI held constant. For a time-independent, , ) , value of steady-state solution of equations 2.1-2.2, D,,,, = D ~ , ~ ~ ( s , the Srn(sll)equals s ~ ( D ~ , which ~ ~ ) , satisfies equation 2.4 trivially. With the above replacement (equation 2.4) we obtain the reduced twovariable system for the averaged synaptic activations, s,, which approximate the exact quantities s,,:

Only recently have averaging procedures been applied to biomathematical models in which the slow variables feedback to influence the fast dynamics; a rigorous basis is just beginning to emerge (Dvofik and SiSka 1989; Pernarowski 1990; Rinzel and Lee 1986). To illustrate the effect of slow synaptic dynamics, we consider a case of unidirectional inhibitory coupling (Fig. 1): gz1 =: 0, uSyn= DK. When the silent cell #1 is stimulated s12 increases slowly and eventually inhibits the follower cell #2. After the stimulus to cell #1is turned off, cell #2 resumes firing, with some delay, when 512 decreases sufficiently. Interpretation of the on-off transitions is based on the 1 / 0 relation for cell # 2 (see legend). In this relation, S,(s12) measures the steady response of # 2 for the input s12 treated as a parameter (lower panel of Fig. 1). Numerical integration of equations 2.1-2.3 was carried out using the software package PHASEPLANE (Ermentrout 1990); the Adams-Bashforth method was used with At = 0.1. To evaluate the functions S,(sij)

537

An Averaged Model for a Slow Synapse Network

Vl

312

061 0.0

O.*

0.0

1, 0

,

100

fi

200

30U

400

I

500

Time

O0 00 00

0 0I 7

0 ~ 0,8 0 14 U

s12 (znhzbztory)

Figure 1: Suppression of autorhythmicity by "slow" synaptic inhibition. (Three upper panels) Cell #2 fires repetitively when uncoupled due to steady bias current, 12 = 0.667. When cell #1 is stimulated (11 = 0.66667, 180 5 t 5 270) inhibition, 512, to # 2 slowly increases; # 2 turns off when s12 exceeds 8,(= 0.123). After stimulus 11 is removed, inhibition slowly wears off and # 2 resumes firing when s12 decreases below eHB(" 0.106). Time courses obtained by numerical integration of equations 2.1-2.3. (Lowest panel) Input/output relation of # 2 with 512 treated as parameter. Ordinate is integrated output (equation 2.4), computed with AUTO (Doedel 1981). osc denotes steady repetitive firing state; ss denotes time-independent stationary state. Dotted lines indicate unstable state. Coexistence of osc and ss for OHB < 512 < 8, leads to hysteresis for on-off transitions of cell # 2 in upper panels. All variables and parameters are dimensionless (see text). Values given here are used throughout except as noted: gNa = 4.0, V N ~= 1.0, 6, = 0.3, k, = 0.12, g K = 4.0, V K = -0.1, 8, = 0.1, kn = 0.1, Tn = 4.5, 81 = 0.0333, 01 == 0.1, g12 = 5.55556, ~ S y n= -0.1, 8syn == 0.43, ksyn = 0.12, T~ = 150, A12 = 1.0.

we used AUTO (Doedel1981), which computes periodic and steady-state solutions of nonlinear differential equations as a function of a parameter.

538

John Rinzel and Paul Frankel

For all dynamic simulations, each Ii includes a small additive stochastic component to diminish delay effects that can occur when a slowly modulated stationary state loses stability (Rinzel and Baer 1988). This noise is uniformly distributed in the interval (-0.005,0.005), and is sampled at each function call. Although this best mimics machine noise or roundoff error, a similar decrease in the ramp effect was obtained by modeling "white" noise with a standard deviation of 0.001. Without noise the system behaves similarly, but our phase plane analysis becomes harder to explain. 3 Mutual Inhibition, Bistability, and the "Synaptic" Phase Plane -

Two mutually inhibitory neurons (vSy,= ZIK for each cell) may exhibit a unique stationary activity pattern if the steady bias currents applied to the cells are in an appropriate range. If only one cell is intensely stimulated, it will fire at a high rate and suppress the other cell. If both receive strong stimulation they will fire steadily but at a rate reduced from their uncoupled rates by inhibition. Less intuitive is an intermediate case of nonuniqueness when the system is bistable: either cell is firing and the other silent. We apply phase plane concepts (Edelstein-Keshet 1988) to interpret this latter case for our averaged model (equation 2.5). The nullcline for S,, is the curve along which St, = 0, and from equation 2.5 this curve is just the graph of the averaged 1 / 0 relation. Thus, in Figure 2, the 321 nullcline is obtained by replotting the lower panel of Figure 1; for 512 we interchange roles becomes the independent variable along the ordinate) and again replot. These nullclines in the "synaptic" phase plane have a restricted interpretation. In particular, the portions corresponding to unstable oscillations of equations 2.1 and 2.2 (open circles and dotted lines on osc branches) are not directly meaningful, since averaging is not justified in such cases. The sign of ill is not determined by the values of S,, and S,, alone when they are in regions where the nullcline is multibranched; one needs to know whether cell i is either oscillating or nearly stationary. Similarly, to identify positions along a trajectory where a cell starts or stops firing one must apply the correct state-dependent threshold values (0, and &B). If approximate "synaptic" trajectories are sought by integrating equation 2.5 (without equations 2.1 and 2.2) it is necessary to keep track of which branch is currently applicable so that branch switching can be carried out properly. An intersection of these nullclines does represent a steady state of equation 2.5, which can be stable only if the associated states (stationary or periodic) of equations 2.1 and 2.2 are stable. In Figure 2 there are two such stable steady states (open squares). These patterns correspond, in the full model, to cell #1 firing and #2 silent, or vice versa. Similar bistability was found experimentally (compare Kleinfeld et al. 1990,

An Averaged Model for a Slow Synapse Network

0.5

539

-

0.3 S21

0.1

-0.1

-

I ,

-0.1

0.3

0.1

0.5

312

v1

-

- 4 . 00.0

A 311

9 321

0.0

0.4

0.0 0

500

1000

1500

Time

Figure 2: Bistability for two cells coupled by slow inhibition. Phase plane for synaptic activation variables (upper panel). Nullclines for the approximating "averaged equation 2.5 are graphs of the cells' 1 / 0 relations (from equation 2.4, computed numerically with AUTO); curve for Sij corresponds to activity of cell j driven by slow synapse from cell i. Each nullcline has two branches, for either stationary or periodic behavior (labeled ss or osc, respectively) of the spikegenerating subsystem equations 2.1-2.2. Stability of these behaviors indicated as follows: cell 1 stable states -;cell 1unstable states . . . . . .; cell 2 stable states 0; cell 2 unstable states 0. Of the several nullcline intersections, two represent stable states (denoted by 0): one cell on (sq x 0.4) and the other off (s,i = 01, or vice versa. Heavy curve corresponds to "synaptic" trajectory for simulation of switching experiment; time courses shown in lower panels. Initially, #2 is on and then, following square current step to #1 (duration, 180; intensity, 1.5), #2 is off and #1 is on. Return to initial state occurs after identical current step to # 2. Currents 1; are so strong that during stimulation, the cell is depolarized beyond firing level. Phase plane trajectory is shown dashed during stimulus application. Parameters particular to this case: 11 = 12 = 0.66667, g21 = 5.55556.

540

John Rinzel and Paul Frankel

Fig. 8-9 to our Fig. 2) and demonstrated by using square current pulses to switch dominance in the pair from one cell to the other. In our analogous simulation which starts with cell #2 active, we see that current to #1 must be delivered long enough not only for #2 to stop firing (when sI2rises above 6), but for the ”synaptic” trajectory to cross the separatrix (the 45” diagonal, in this case of identical cells) between the two attracting states. Observe that there is a delay between the termination of the current and when #1 starts firing. It is clear from this graphic representation that firing commences only after inhibition, ~ 2 1 , from #2 drops below OHB. This delay phenomenon illustrates the controlling role played by the slow synaptic variables, and the importance of considering their transient behavior for proper interpretation of firing patterns. We offer a few additional observations regarding the “synaptic” phase plane. As one expects, the trajectory crosses the 512 nullcline vertically (lower right), and similarly, during the reswitching phase, the S21 nullcline is crossed horizontally. Also, the dependence of activity patterns on the steady bias currents can be related to changes in the nullclines. For example, a strong depolarizing bias to cell #2 would shift the 321 nullcline rightward (with some change in shape) and thereby preclude bistability. The only stable state would be with #2 firing and #1 off. A classical treatment of mutual inhibition might assume a cell’s I/O relation (equation 2.4) is monotonic. In that case one could also predict bistability for the pair over a range of bias currents (Kleinfeld et ul. 1990). While our model cells exhibit hysteresis in their 1 / 0 relations, this does not significantly influence the behavior of our mutually inhibitory pair over parameter ranges we have explored. However, if one of the cells is excitatory, hysteresis can lead to a qualitatively different pattern, in which the cells can establish a slow alternating oscillatory rhythm. We explore this next.

4 A Two-Cell Burster The averaged 1 / 0 relation for a cell which receives steady or slow excitatory synaptic input is illustrated in Figure 3 (upper). The hysteretic effect near threshold is quite substantial for these simplified HH spikegenerating dynamics. If the target cell (#2), when brought to firing, could slowly inhibit the sending cell then 512 might decrease enough (below 0,) so that #2 would turn off. With #2 silent, the output, ~ 1 2 ,of #1 could recover to its prior level and thereby reinitiate firing in #2. As this cycle repeats we would observe rhythmic bursting. Figure 4 illustrates such an oscillation for two model cells: one excitatory and the other inhibitory. The bias currents and synaptic weights are set so that the activity threshold for the excitatory cell lies between the on and of activity levels for the inhibitory cell, that is, the B12 nullcline crosses midway through the overlapping osc and ss branches of the 321

541

An Averaged Model for a Slow Synapse Network

00

ff, 8 " ~ 0.5

1.o

sIz (exatatory)

1.0

0.9 0.8

0.7 n

0.6 0.5 0.4

0.3 0.0

0.5

1.0

512

Figure 3: Input/output relation for cell that receives steady or slow synaptic excitation. (Upper panel) Integrated activity (equation 2.4) of cell #2 as function of excitation from cell #1; 512 treated as parameter in equations 2.1-2.2, which are solved using AUTO. Labels as in Figure 1 (lower); dotted lines denote unstable state. Repetitive firing and stationary behavior coexist over range 8, < s12 < 8 ~ (8, = 0.132,8HB FZ 0.317). For large 512, repetitive firing is precluded; response is stable steady depolarization. (Lower panel) Potassium activation n versus 512 for output states shown in upper panel. For osc branch, maximum and minimum values of n during a cycle of repetitive firing are shown. Note, in coexistence range, nmin for stable and unstable osc branches are nearly equal and close to ss, indicating that the v - n trajectories for these periodic states are extremely close during slow recovery phase of action potential near the rest state. Parameters particular to this case: g12 = 2.0,12 = 0.

8

542

John Rinzel and Paul Frankel

Figure 4: Bursting oscillation of an excitatory-inhibitory cell pair with slow synapses. Phase plane for synaptic activation variables (upper panel). Nullcline for Sl2 is defined and obtained as in Figure 2; similarly for 521, with 1 / 0 relation for excitation from Figure 3. There is no intersection of nullclines here that corresponds to a stable steady state of averaged equation 2.5. Rather, these equations are assumed to have a stable periodic solution (not shown) that approximates the bursting oscillation (closed curve with arrowheads) of the full model, equations 2.1-2.3. Since s12 is slower than s21 (A12 = 0.333, A21 = 1.01, the synaptic trajectory basically traverses the hysteresis loop of the 321 nullcline. Time courses (lower panel) show alternating activity pattern of the burst rhythm. On-off transitions of spiking are labeled on time courses and in phase plane. Neither cell is firing during phase a-b; both are firing during c-d. Ragged appearance of v during spiking is due to plotting artifact. Parameters particular to this case: 11 = 0.66667,12 = 0.0.

nullcline (upper panel). Thus, when #2 is active and s21 is high 512 decreases, but when #2 is inactive and szl is low then 512 increases from low levels. The activity of #1 lags somewhat behind that of #2 because we have set A12 to be 1/3 of AZ1. Thus, in the time courses, w e see that 512 rises

An Averaged Model for a Slow Synapse Network

543

and falls more slowly than sZ1. If s12were even slower, the closed orbit in the phase plane would have nearly vertical jumps between phases during which it would ride the upper or lower branch of the hysteresis loop. Notice that cell #2 stops firing early, that is, the trajectory drops vertically before it reaches the left knee. This feature can be explained by considering the V - n phase plane of the action potential dynamics. For s12 in the overlap regime (Fig. 3 upper), equations 2.1 and 2.2 have a stable steady state surrounded by the stable limit cycle of repetitive firing. These two attractors are separated by an unstable periodic orbit (dotted osc). Between the depolarizing phases of successive spikes, the variable n falls to a minimum as the V - n trajectory passes near the steady state (Fig. 3 lower). Moreover, the stable cycle is extremely close to the unstable cycle during this phase. Hence, small fluctuations from noise (perhaps analogous to channel noise or action currents attenuated through gap junctions) and/or the dynamic variation of 512 can perturb the trajectory across the unstable cycle into the domain of attraction of the steady state and thereby can prematurely terminate the repetitive firing mode. The bursting behavior of Figure 4 depends on hysteresis in the 1 / 0 relation for at least one, but not necessarily both, of the cells. Nonmonotonicity for the cell receiving inhibition is not required. The reverse situation (with sI2 faster and with hysteresis in its nullcline) could also lead to bursting. We note that if both I/O relations were monotonic, as in the classical treatment, bursting would not be possible. Of course, bursting in our example would be precluded if a strong depolarizing bias were applied to #l. In this case, the 512 nullcline would approximately translate upward and both cells would fire continuously. The robustness of bursting is determined by how each parameter shifts the nullclines. As long as the nullclines intersect in the same generic manner, bursting is obtained. For example, the bias current 11 given to cell #1 permits a large range of possible values for bursting (0.6,l.l). On the other hand, the bias current I2 has a much smaller permissible range (-0.03,0.02). 5 Discussion

For two model neurons with slow synaptic coupling we have obtained a reduced model by averaging over the HH-like spike-generating dynamics. Phase plane methods were applied to the reduced synaptic activation equations to predict and interpret activity patterns for the full model. We illustrated our approach by considering examples of nonlinear behavior: bistability for mutual inhibitory cells and bursting for an excitatory-inhibitory pair. A model of mutually excitatory cells can also exhibit multistable behavior, for example, with the two states of on-on and of-of. It yields to essentially the same analysis as the inhibitory pair model and therefore is not discussed in this work.

544

John Rinzel and Paul Frankel

The illustrated mechanism for rhythmic bursting does not involve slowly varying intrinsic membrane currents, such as a slow voltagedependent conductance or a calcium-modulated current. Here, an isolated cell does not burst. Nor does the mechanism reflect nonlinearities associated with the cells’ connectivity, for example, autocatalytic effects of synaptic self-excitation (Wilson and Cowan 1972). The primary factor is hysteresis in the near-threshold behavior for repetitive firing of the spike-generating dynamics. This feature should not be viewed as extraordinary. There are a few generic ways that oscillations can arise as parameters are varied (Rinzel and Ermentrout 1989). The Hopf bifurcation to periodic behavior is associated with a nonzero minimum frequency and the onset may involve bistability with a coexistent stable oscillation and a stable steady state over an interval of parameter values adjacent to the threshold value. This is the case for our V - n system, as well as for the standard HH model (Rinzel 1978). For a different generic onset behavior (referred to as Type I in Rinzel and Ermentrout 1989) the spike frequency rises smoothly from zero. Such a spike-generator in our setup, considered by us and by G. B. Ermentrout (private communication), leads to monotonic averaged 1 / 0 relations and hence cannot cause bursting. Repetitive firing in the cultured Aplysia neurons (Kleinfeld et al. 1990) exhibits a sudden onset with nonzero frequency, as in the Hopf bifurcation mechanism for our model cells. However, hysteresis as in Figures 1 and 3 was not reported for these experiments. Thus it is uncertain whether these neurons would support bursting oscillations of the sort we have found. Nevertheless, it is intriguing that hysteresis in a simple HH-like membrane with simple model synapses that have no fatiguing properties can lead to network-mediated oscillations.

References Adams, P. R., Jones, S. W., Pennefather, P., Brown D. A., Koch, C., Lancaster, C. 1986. Slow synaptic transmission in frog sympathetic ganglia. J. E x p . Biol. 124, 259-285. Doedel, E. J. 1981. AUTO: A program for the automatic bifurcation analysis of autonomous systems. Congr. Numer. 30, 265-284. Edelstein-Keshet, L. 1988. Mathematical Models in Biology. Random House, New York. Ermentrout, G. B. 1990. PHASEPLANE: The dynamical systems tool, Version 3.0. Brooks/Cole Publishing Co., Pacific Grove, CIA. DvoEiik, I., and SiSka, J. 1989. Analysis of metabolic systems with complex slow and fast dynamics. Bull. Math. Biol. 51(2),255-274. Hopfield, J. J., and Tank, D. W. 1986. Computing with neural circuits: A model. Science 233, 625-633.

An Averaged Model for a Slow Synapse Network

545

Kleinfeld, D., Raccuia-Behling, F., and Chiel, H. J. 1990. Circuits constructed from identified Aplysia neurons exhibit multiple patterns of persistent activity. Biophys. J. 57, 697-715. Perkel, D. H., and Mulloney, 8. 1974. Motor pattern production in reciprocally inhibitory neurons exhibiting postinhibitory rebound. Science 185, 181-183. Pernarowski, M. 1990. The mathematical analysis of bursting electrical activity in pancreatic beta cells. Ph.D. Thesis, University of Washington. Rinzel, J. 1978. On repetitive activitiy in nerve. Fed. Proc. 37(14), 2793-2802. Rinzel, J., and Lee, Y. S. 1986. On different mechanisms for membrane potential bursting. In Nonlinear Oscillations in Biology and Chemisty, H. G. Othmer, ed. Lecture Notes in Biomathematics, Vol. 66, pp. 19-33. Springer, New York. Rinzel, J., and Baer, S. M. 1988. Threshold for repetitive activity for a slow stimulus ramp: A memory effect and its dependence on fluctuations. Biophys. J. 54,551-555. Rinzel, J., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds., pp. 135-171. MIT Press, Cambridge, MA. Syed, N. I., Bulloch, G. M., and Lukowiak, K. 1990. In vitro reconstruction of the respiratory central pattern generator of the mollusk Lymnaeu. Science 250,282-285. Wilson, R. H., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12, 1-24.

Received 8 July 1991; accepted 16 September 1991

This article has been cited by: 2. Oren Shriki , David Hansel , Haim Sompolinsky . 2003. Rate Models for Conductance-Based Cortical Neuronal NetworksRate Models for Conductance-Based Cortical Neuronal Networks. Neural Computation 15:8, 1809-1841. [Abstract] [PDF] [PDF Plus] 3. Eugene M. Izhikevich, Frank C. Hoppensteadt. 2003. Slowly Coupled Oscillators: Phase Dynamics and Synchronization. SIAM Journal on Applied Mathematics 63:6, 1935. [CrossRef] 4. Bo Cartling. 1996. Response Characteristics of a Low-Dimensional Model NeuronResponse Characteristics of a Low-Dimensional Model Neuron. Neural Computation 8:8, 1643-1652. [Abstract] [PDF] [PDF Plus] 5. François Chapeau-Blondeau , Nicolas Chambet . 1995. Synapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient wijSynapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient wij. Neural Computation 7:4, 713-734. [Abstract] [PDF] [PDF Plus] 6. Bard Ermentrout . 1994. Reduction of Conductance-Based Models with Slow Synapses to Neural NetsReduction of Conductance-Based Models with Slow Synapses to Neural Nets. Neural Computation 6:4, 679-695. [Abstract] [PDF] [PDF Plus] 7. Paul Frankel, Tim Kiemel. 1993. Relative Phase Behavior of Two Slowly Coupled Oscillators. SIAM Journal on Applied Mathematics 53:5, 1436. [CrossRef]

Communicated by Nancy Kopell

Phase Coupling in Simulated Chains of Coupled Oscillators Representing the Lamprey Spinal Cord Thelma L. Williams Department of Physiology, St.George's Hospital Medical School, Tooting, London S W17 ORE, United Kingdom

Previous application of a mathematical theory of chains of coupled oscillators to the results of experiments on the lamprey spinal cord led to conclusions about the mechanisms of intersegmental coordination in the lamprey. The theory provides no direct link, however, to electrophysiological data obtained at the cellular level, nor are the details of the neuronal circuitry in the lamprey known. In this paper, a variant of the theory is developed for which the relevant variables can potentially be measured. This theory will be applied to measurements on simulated oscillators, based on a network that has been postulated to constitute the basic circuitry of the segmental oscillator in the lamprey. A linear approximation to the equations is derived, and it will be shown that the behavior of simulated chains of these oscillators obeys the predictions of this approximation. 1 Introduction

Many neural systems produce periodic activity, and may thus be regarded as oscillators or as systems of oscillators. Kopell and Ermentrout have developed a mathematical theory of chains of coupled oscillators, in which the effect of the coupling between two oscillators is dependent on the phase difference between them (Ermentrout and Kopell 1984; Kopell and Ermentrout 1986,1988,1990; Kopell etal. 1991). The results of experiments on the lamprey spinal cord have been interpreted in the context of this mathematical theory, to reveal properties of the intersegmental coordinating system. In particular, it was concluded that ascending coupling dominates descending coupling in determining the value of the intersegmental phase lag (Williams et al. 1990; Sigvardt et al. 1991). A disadvantage of the theory is that in the absence of a direct interpretation in terms of membrane potentials and synaptic transmission it is difficult to judge how well the underlying assumptions of the theory are met by the biological system, or to get a biologically intuitive grasp of what the analysis is about. To bridge some of the conceptual gap between the mathematical analysis and the behavior of real neurons, I have developed an operational variant of the theory, which can be directly applied Neural Computation 4, 546-558 (1992) @ 1992 Massachusetts Institute of Technology

Chains of Coupled Oscillators Applied to Lamprey Spinal Cord

547

to an assemblage of simulated neurons. I have also derived a linear approximation to the equations governing the behavior of a chain of such oscillators, and I will show, for specific examples of coupling between neurons in neighboring segments, that the behavior of a coupled chain of such oscillators is in accordance with the predictions of the theory. 2 The Central Pattern Generator for Locomotion in the Lamprey ___

The lamprey spinal cord can produce rhythmic motor patterns of ventral root activity (fictive locomotion) in the absence of any patterned input (Cohen and Wallen 1980). The cyclic pattern consists of alternating bursts of activity in the left and right ventral roots of each spinal cord segment, with a rostral-caudal delay of activation along the length of the cord. Faster swimming is accomplished by a decrease in the cycle duration at all segments. The intersegmental time delay scales with the cycle duration, so that the delay between segments remains a constant fraction of the cycle, at about 0.01 (Wall& and Williams 1984). It is this intersegmental phase lag of about 1%of a cycle per segment that is the focus of the mathematical analysis. 3 The Mathematical Theory

The Kopell-Ermentrout formulation considers a chain of unit oscillators, each of which has intrinsic frequency w. It is assumed that the oscillators are coupled to their nearest neighbors (Fig. 1A). Ascending coupling may differ from descending coupling, but both are assumed to be uniform along the cord. The effect of the coupling between two neighboring oscillators is mediated by ”H-functions,” which give the change in frequency caused by the coupling from one oscillator to another, as a function of the phase difference between them. It is further assumed that this effect is additive; thus the frequency Ok of the kth oscillator in the interior of a chain of length n is given by (Kopell and Ermentrout 1988) flk = wk

+HA(&)

+ HD(-($k-I)

1
(3.1)

where wk is the intrinsic frequency of the uncoupled oscillator and $k is the phase of oscillator k 1 minus the phase of oscillator k. The ascending coupling function, HA($), gives the change in frequency caused by coupling from an oscillator’s caudal neighbor, while the descending coupling function, H D ( - ~ )gives , the influence of its rostra1 neighbor. For both functions, the phase difference upon which the coupling depends is expressed as the phase of the sending oscillator minus the phase of the receiving oscillator. For descending coupling this is in the opposite direction to the rostral-caudal indexing of the phase lag; hence the minus sign within the functional dependence of HD.

+

Thelma L. Williams

548

A

B

-." *.

'. '.

Figure 1: (A) Diagram of a chain of unit oscillators, numbered 1 to n in the rostral-caudal direction. Each oscillator has three types of output: ascending coupling signals (HA)to its rostra1 neighbor, descending signals (HD) to its caudal neighbor, and excitatory output to swimming motoneurons (horizontal arrows). (B) Network modeling the unit oscillator, which is identical to that proposed by Buchanan and Grillner (1987) except that the cell labeled I replaces the lateral interneuron (LIN) in the original network. This is because the lamprey spinal cord contains fewer than one LIN per segment (Rovainen 1974) and because there exist small inhibitory interneurons with powerful influence on fictive locomotion in the lamprey spinal cord (Buchanan and Grillner 1988). The output of the circuit is provided by the activity of the E cells. Parameter values (see text): ei = 0.1 for C cells, 0.025 for I and E cells; for all cells thi = 0, maxi = 1.0, ri = 0.1, Y; = 0; wij = 0.5 for all excitatory synapses, 1.0 for all inhibitory synapses; zqj = 1.0 for all excitatory synapses, -1.0 for all inhibitory synapses. The frequency of the oscillatory output, w, was increased (or decreased) by proportionately increasing (or decreasing) ei for all cells.

It is assumed that both the ascending and descending coupling are capable of both speeding u p and slowing down the receiving oscillator. The phase values at which the frequency remains unchanged, 4~ and 4 ~ respectively, are defined as follows:

(3.2) (3.3)

,

Chains of Coupled Oscillators Applied to Lamprey Spinal Cord

549

These are referred to as zero-crossing values. Equation 3.1 also holds for the first and last oscillator in the chain if HD(-$o)and HA($,,)vanish from the first and last equation, respectively. Now $0 and (Pn do not actually exist, since there is no 0th or ( n 1)th oscillator, so they are defined as identically equal to the phase lags for which HAand HD vanish, as follows:

+

(3.4) (3.5)

The zero-crossingphase lags $D and $A can thus be considered as ”boundary conditions” on equation 3.1. Kopell and Ermentrout have shown that for unspecified, nonlinear H-functions that are monotonic with appropriate slopes (i.e., positive and negative, respectively) over a range of 4 that includes their zero crossings, these equations have a unique solution. In this solution, the intersegmental phase lags $k lie between the boundaries $D and $A as k goes from 1 to n - 1. Furthermore, for asymmetric coupling of equally activated oscillators (all Wk equal), the theory predicts that the intersegmental phase lag will be approximately equal to the zero-crossing phase lag of the ”dominant” coupling, except in a boundary layer at the rostral or caudal end. Which coupling is dominant is determined by the relative magnitudes of each coupling function at the zero crossing of the other. Thus if IHA($D)I > IHD(-$A)( (see Fig. 3A), then ascending coupling is dominant, and the boundary layer will occur at the nondominant or rostral end (Kopell and Ermentrout 1986; Williams et al. 1990). In this formulation, a positive value for 4 indicates that the caudal segment leads the rostral one. This arises from the numbering of the segments from the rostral to the caudal end, and is opposite to the convention used for physiological measurements of phase lag (e.g., Wallen and Williams 1984). In the present formulation, if $ is negative then the wave of activation travels from the rostral to the caudal end, as in a forward-swimming lamprey. The H-functions in equation 3.1 can be constructed mathematically (Ermentrout and Kopell 1991), if it is known how the effects of one oscillator on another depend on the phase of each oscillator within its own cycle. Unfortunately, there is no way to determine these effects experimentally, even if the cells making up the oscillator were identified. What can be measured, both in simulations and in the lamprey spinal cord, are the ensemble frequency 0 and the phase lags between segments 4, under any particular set of conditions. Furthermore, the value of the intrinsic frequency w can be manipulated by varying the level of activation of the cells comprising the oscillators. In this paper, a method is developed for estimating the values of the H-functions over the range of q5 for which the coupling is stable, using measurements of the phase lag between coupled simulated oscillators with varying levels of activation.

Thelma L. Williams

550 4 Modeling the Unit Oscillator

Grillner ef al. (1988)have shown by computer simulation that if a neuronal network as shown in Figure 18 is provided with a tonic excitatory input, it can produce a periodic output that shares many of the features of the activity recorded from the lamprey spinal cord. In particular, the phase relationships between the three types of neurons are similar to those seen during fictive locomotion. More recently, Buchanan (1990, 1992) has performed simulations using a relatively simple connectionist algorithm and has shown that the network connectivity alone is sufficient to generate the activity pattern observed in these cells in the lamprey spinal cord. (For a review of modeling in the lamprey, see Sigvardt and Williams 1992.) Buchanan's connectionist model is robust and works well over a large parameter space. A similar model has been used for the unit oscillator in the present study. For each unit oscillator there are 6 "activation" variables a,, which can be taken to represent the membrane potential trajectories of the 6 cells making up the oscillator. If u, is above a threshhold value th,, it can be taken as a measure of the firing rate f, of the cell, with

{ ::-

a,

< th,

ji= th,, a, 2 fh, In each time step the change in the activation variable of cell i is given by

Aa, = e,(max,-4 + rt(rl- a,) + C, q f , ( v ,- a,) where elis the tonic excitatory drive to cell i, max, is the maximum firing rate of the cell, r, represents the resting potential, T, is a time constant for represents the weight of the synapse from cell j to cell i, return to r,, w,, and vJlrepresents the "reversal potential" for the synapse. These equations were derived from those provided as an option (attributed to Grossberg 1978) in the commercially available software of McClelland and Rumelhart (1988), which emulates parallel distributed processing on a small personal computer. Normalized values were used for all the parameters (see legend, Fig. 1). 5 Determining the Coupling Functions

The forms of intersegmental coupling considered have been restricted to the types of synaptic interactions seen within a segment. For example, coupling mediated by the I cell (Fig. 18)will have only an ipsilateral C cell of a neighboring segment as a postsynaptic element, whereas for coupling mediated by the output of C, all three neurons on the contralateral side of the receiving oscillator could be postsynaptic. The coupling functions were determined by simulating two oscillators, with coupling in only one direction at a time. To define the new coupling functions, it is necessary to return briefly to the Kopell-Ermentrout

552

Thelma L. Williams

trajectories of homologous E-cells in the two oscillators (see Fig. 2A-C). The phase lag was taken as equal to the distance (representing time) between the two trajectories as they crossed threshold, divided by the cycle ) constructed. duration. In this way the function CA($)= (w1 - U J ~was This is illustrated in Figure 2 for a particular coupling regime. To determine the zero-crossing phase lag, the coupled oscillators were given the same intrinsic frequency, by giving them equal levels of external input (Fig. 28). The phase difference which resulted gives the value of the zerocrossing phase lag, 4.4.To find other points of the function, the receiving oscillator was given a higher or lower level of activation, corresponding to a higher or lower intrinsic frequency. The solid line in Figure 2D gives the resulting C-function for this particular coupling regime. In these simulations it is actually the inverse of the C-functions that are determined, that is, the value of C is set (as w2 - w l ) and the value of 4 that results from the simulation is determined. Since the corresponding H-functions are periodic (a phase lag of -0.5 must be the same as a phase lag of 0.51, there are in fact two values of q!J corresponding to a given value of H , one of which occurs in a region of the function with a positive slope, and one in a region with negative slope (see Fig. 2D). Only over the positive slope region is entrainment stable, and the simulations therefore result in phase lags in this region only, which is the domain of C. The negative slope region of the curve in Figure 2D is simply drawn in by hand (dotted line) to illustrate a plausible form for the corresponding H-function. Any coupling regime can be used as either ascending or descending coupling. The C-function for one direction (e.g., descending coupling) can be obtained from the C-function for the other (ascending coupling) simply by changing the sign of 4. Thus a coupling regime that produces a negative zero-crossing phase lag (corresponding to a tailward-traveling wave) will produce a positive one if used in the opposite direction, and vice versa (see Fig. 4A). 6 Behavior of a Chain of Oscillators

When n oscillators connected in a chain are entrained, that is, all oscillating at the same frequency R, the resulting phase lags represent the solution of equations 3.1. In the lamprey spinal cord, there is no evidence for a gradient of intrinsic frequencies along the cord (Cohen 19871, so all Wk in this equation can be considered equal. To apply the theory to a simulated chain of oscillators, the H-functions in this equation can be replaced by the measured C-functions. Furthermore, in Figure 3A it is apparent that these functions are nearly linear in the region of the zerocrossings, and can thus be approximated by the following expressions:

0)

W

L

W2’Wl .............

\

_-...

‘ /

r

-0

c

LL

~

I

0)

W

0 I=

-40% 1 -0.5 I

-0.25

l

0

l

0.25

lntersegmental phase lag (

0

10%

._ -20%

L

D

6

v/?i/

wz=w, ..............................

0 I=

W

-A

f

-

I

#

0.5

1

Figure 2: Construction of an ascending coupling function CA from a regime consisting of a repeat of all the intrasegmentalsynapses scaled by a factor of 0.05. Thus each cell of segment 2 makes the following synaptic contact with cells in segment 1: C inhibits contralateral C, I, and E; I inhibits ipsilateral C; and E excites ipsilateral I and C. The activity of the network was simulated using an algorithm by Grossberg and included as an option in the software of McClelland and Rumelhart (1988). In (A)-(C) is shown the simulated membrane potential of an E cell in segment 1 (dashed line) and the ipsilateral E in segment 2 (solid line). The action potential threshold is indicated by the dotted line, and the frequency of action potentials is taken as proportional to the amount that the membrane potential exceeds threshold. In each case R, the frequency of the entrained pair, is equal to wz, so that in (A) and (C), oscillator 2 must speed up or slow down oscillator 1 in order to entrain it, and in (B) it must do neither. Phase values for these three simulations are shown as triangles in @). Other points in (D) come from simulations over the range of w1 (at constant w2) for which 1 : 1 entrainment of segment 1 by segment 2 occurred. The dotted line represents the portion of the corresponding H-function over which entrainment is unstable (see text).

C

6

A

CA

1-2

Thelma L. Williams

554

0

-0.03

0.03

Intersegmental phase lag

( 4)

B c3 c

0 c

0

a 0)

la c L n

: E

time

c "8

c -

0

2

4

6

8

1

0

Segment number ( k )

Figure 3: Simulated activity of a chain of 10 segments compared with that predicted mathematically. (A) C-functions constructed separately from simulations of 2 segments (as in Fig. 2). Solid circles: ascending coupling (CA) mediated by I and C of segment 2 to the appropriate cells of segment 1. Solid squares: descending coupling (CD) mediated by E and C of segment 1 to the appropriate cells of segment 2. Synaptic strengths were 0.05 times the intrasegmental ones. This ascending coupling is "dominant" over this descending coupling, since ICA($D)I > [CD(-#A)/(dotted lines). (B) Simulated membrane potential of ipsilateral E cells in segments 1, 3, 5, 7, and 9 of a simulated chain of 10 segments with ascending and descending coupling as in (A). (C)Intersegmental phase lag as a function of segment number (solid circles), average values (with standard deviations) from 4 cycles. The variability is due primarily to the discrete nature of the simulation (one cycle of activity occupies approximately 30 simulated time steps). The solid squares give 4~ and #A as determined in (A). The solid line is drawn from equation 6.4 using the ratio of slopes measured from (A).

Chains of Coupled Oscillators Applied to Lamprey Spinal Cord

555

where both a and /3 > 0, so that C A ( ~has ) a positive slope, C&$) a negative one. With this linear approximation equation 3.1 for a chain of entrained oscillators becomes =W

+a(d'k - 4

~-)P(dk-1

-d

~ ) , 1
(6.3)

which is a first-order linear recurrence relation for 4k. Using the boundary conditions given in equations 3.4 and 3.5, the solution to this equation is

where

f =Pla From this equation it can be seen that $k must lie between the limits of $A and 4 ~ Iff . < 1, (ascending coupling dominant), the coefficient of in equation 6.4 will be greater than the coefficient of 4~ for most values of k; the opposite will be true for f > 1. The curve drawn in Figure 3C is calculated from equation 6.4 with n = 10 and f = 0.6. This curve gives the predicted values of the intersegmental phase lags for a chain of 10 oscillators with coupling as in Figure 3A. The filled circles in Figure 3B are the results of a simulation of such a chain, and it can be seen that the intersegmental delay was approximately equal to 4~over most of the chain, with a boundary layer at the rostral end. The simulation agrees well with that predicted from the data of Figure 3A, obtained from a single pair of oscillators, and shown by the solid line. For simulations using longer chains, the width of the boundary layer at the rostral end remained essentially unchanged, and the intersegmental phase lag for the additional segments was approximately equal to 4 ~ This . is also predicted by equation 6.4, even for f as great as 0.9 (for n 2 10). The direction of travel of the wave of activation depends on the sign of the zero-crossing phase lag of the dominant coupling. Thus in the coupling shown in Figure 3A, a rostral-to-caudal wave would result with either ascending or descending coupling dominant, since both have negative zero-crossings. For the coupling regime of Figure 4, on the other hand, the direction of wave travel depends on which coupling is dominant. For ascending coupling dominant, the wave travels rostral-tocaudal; for descending dominant, it travels caudal-to-rostral. These examples illustrate the fallacy in the intuitive argument that the direction of wave travel should be in the direction of the dominant coupling. For symmetric coupling, f = 1, and the solution to equation 6.3 is given by (6.5)

Thelma L. Williams

556

3 I

C I I

-1.5%

7

-0.05

0.00

0.05

Intersegmental phase lag ( $ 1

B

-0.02

! 0

A

I

2

-

I

4

6

8

1

Segment number ( k

0

I

Figure 4: The effect of dominance on intersegmental phase lag. (A) Ascending coupling (C,) as in Figure 2; descending coupling (CD) the same but with synaptic strengths reduced from 0.05 to 0.02 times the intrasegmental strengths, making ascending coupling dominant. (B)Intersegmental phase lags from simulated chains with different coupling strengths. Relative ascending/descending synaptic strengths in upper to lower curve, in order, were 0.01/0.05,0.05/0.05, 0.05/0.02, 0.05/0.01. Note that for symmetric coupling (0.05/0.05), the phase lag varies along the entire length of the chain, as predicted by equation 6.5, and changes sign in the center. This represents a traveling wave beginning in the center and traveling in opposite directions toward the head and tail ends. For asymmetric coupling the traveling waves were of approximately uniform speed except in the boundary layer, the width of which decreased with increasing dominance. For a chain of greater length, the phase lags for the asymmetric regimes remained virtually unchanged over the first 10 segments (lower 2 curves) or last 10 segments (upper curve), and remained approximately equal to the zero-crossing of the dominant coupling for the remaining segments.

Chains of Coupled Oscillators Applied to Lamprey Spinal Cord

557

which represents a linear change in phase lag along the chain. This prediction was also confirmed, as seen in the triangles of Figure 4B. One of the assumptions implicit in the application of the mathematical analysis is that the coupling signals received by an oscillator d o not alter those transmitted by that oscillator. This assumption is likely to be met if the coupling does not significantly distort the relative timing and magnitude of the activity of the cells within a unit oscillator. In this study the strength of the coupling synapses was low compared to intrasegmental synaptic strengths (0.01 to 0.05), and it was found that the intersegmental phase lag was close but not precisely equal to the zero-crossing of the dominant coupling (as in Figure 4B). Thus this study has demonstrated that intrasegmental coupling between cells comprising the unit oscillator can give rise to well-behaved C-functions and behavior that obeys the predictions of the mathematical analysis. In particular, with asymmetric coupling of equally activated oscillators, the intersegmental phase lag is uniform over the length of the chain, except for a small boundary layer at one end, and is approximately equal to the zero-crossing phase lag of the dominant coupling. Acknowledgments

I am grateful to Jim Buchanan for introducing me to the McClellandRumelhart software and showing me his unpublished results for the lamprey CPG, and to Graham Bowtell for helpful suggestions on equations 6.4 and 6.5. This work was supported by the SERC. References Buchanan, J. T. 1990. Simulations of lamprey locomotion: Emergent network properties and phase coupling. Eur. ]. Neurosci., Suppl. 3, 184 (abstract). Buchanan, J. T. 1992. Neural network simulations of coupled locomotor oscillators in the lamprey spinal cord. Bid. Cybernet. 66, 367-374. Buchanan, J. T., and Grillner, S. 1987. Newly identified 'glutamate interneurons' and their role in locomotionin the lamprey spinal cord. Science 236,312-314. Buchanan, J. T., and Grillner, S. 1988. A new class of small inhibitory interneurones in the lamprey spinal cord. Brain Res. 438, 404407. Cohen, A. H. 1987. Effects of oscillator frequency on phase-locking in the lamprey central pattern generator. ]. Neurosci. Methods 21, 113-125. Cohen, A. H., and Wall&, P. 1980. The neuronal correlate of Iocomotion in fish. 'Fictive swimming' induced in an in vitro preparation of the lamprey. Exp. Brain Res. 41, 11-18. Ermentrout, G. B., and Kopell, N. 1984. Frequency plateaus in a chain of weakly coupled oscillators. SlAh4 1.Math. Anal. 15, 215-237. Ermentrout, G. B., and Kopell, N. 1991. Multiple pulse interactions and averaging in systems of coupled neural oscillators. ]. Math. B i d . 29, 195, 217.

558

Thelma L. Williams

Grillner, S., Buchanan, J. T., and Lansner, A. 1988. Simulation of the segmental burst generating network for locomotion in lamprey. Neurosci. Lett. 89, 31-35. Grossberg, S. 1978. A theory of visual coding, memory, and development. In: E. L. J. Leeuwenberg, and H. F. J. M. Buffart (eds.) Formal Theories of Visual Perception. New York Wiley. Kopell, N., and Ermentrout, G. 8. 1986. Symmetry and phase-locking in chains of weakly coupled oscillators. Comm. Pure Appl. Math. 39, 623-660. Kopell, N., and Ermentrout, G. 8. 1988. Coupled oscillators and the design of central pattern generators. Math. Biosci. 90, 87-109. Kopell, N., and Ermentrout, G. B. 1990. Phase transitions and other phenomena in chains of coupled oscillators. SIAM J. Appl. Math. 50, 1014-1052. Kopell, N.,Ermentrout, G. B., and Williams, T. L. 1991. On chains of oscillators forced at one end. S l A M J. Appl. Math. 51, 10-31. McClelland, J. L., and Rumelhart, D. E. 1988. Explorations in Parallet Distributed Processing: A Handbook of Models, Programs, and Exercises. MIT Press, Cambridge. Rovainen, C. M. 1974. Synaptic interactions of identified nerve cells in the spinal cord of the sea lamprey. J. Comp. Neurol. 154, 189-206. Sigvardt, K. A., Kopell, N., Ermentrout, G. B., and Remler, M. P. 1991. Effects of local oscillator frequency on intersegmental coordination in the lamprey locomotor CPG: theory and experiment. SOC.Neurosci. Abst. 17, 122. Sigvardt, K. A., and Williams, T. L. 1991. Models of central pattern generators as oscillators: The lamprey locomotor CPG. Semin. Neurosci. 4, 37-46. Wallen, P., and Williams, T.L. 1984. Fictive locomotion in the lamprey spinal cord in nitro compared with swimming in the intact and spinal animal. J. Physiol. 347, 22.5239. Williams, T. L., Sigvardt, K. A., Kopell, N., Ermentrout, G. B., and Remler, M. P. 1990. Forcing of coupled nonlinear oscillators: Studies of intersegmental coordination in the lamprey locomotor central pattern generator. J. Neurophysiol. 64,862-871.

Received 13 August 1991; accepted 11 November 1991.

This article has been cited by: 2. A. Rabinovitch, M. Gutman, I. Aviram. 2003. Reaction-diffusion dynamics in an oscillatory medium of finite size: Pseudoreflection of waves. Physical Review E 67:3. . [CrossRef] 3. F. K. Skinner, C. Wu, L. Zhang. 2001. Phase-coupled oscillator models can predict hippocampal inhibitory synaptic connections. European Journal of Neuroscience 13:12, 2183-2194. [CrossRef] 4. KAREN A. SIGVARDT, WILLIAM L. MILLER. 1998. Analysis and Modeling of the Locomotor Central Pattern Generator as a Network of Coupled Oscillators. Annals of the New York Academy of Sciences 860:1 NEURONAL MECH, 250-265. [CrossRef]

Communicated by Kenneth Miller

Understanding Retinal Color Coding from First Principles Joseph J. Atick Zhaoping Li A. Norman Redlich School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540 USA A previously proposed theory of visual processing, based on redundancy reduction, is used to derive the retinal transfer function including color. The predicted kernels show the nontrivial mixing of spacetime with color coding observed in experiments. The differences in color coding between species are found to be due to differences among the chromatic autocorrelators for natural scenes in different environments. 1 Introduction

The retinas of many species code color signals in a nontrivial way that is strongly coupled to their coding of spatial and temporal information. For example, in the primate retina many color coding ganglion cells are excited by "green" light' falling on the centers of their receptivefields on the retina, but their response is inhibited by " r e d light falling in a concentric annulus about the green center - called the surround region of their receptive field. There are also red center, green surround cells (Derrington e t d . 1984)as well as rarer types involving blue cones. Such arrangements, which can be termed "single-opponency," are not the only types found in nature. For example, freshwater fish such as goldfish and carp have a different type of coding called "double-opponency" (Daw 1968). Their ganglion cells are color opponent - they calculate the difference between the outputs of different cone types at each spatial location - and they are spatially opponent (with a center surround receptive field) but their color and spatial encoding are mostly decoupled. One of the challenges for a theory of retinal processing is to account for the difference between this double-opponent goldfish code and the single-opponent primate code, as well as the range of intermediate response types observed in other species. 'In this paper, we use "green" and "red to denote light with spectral frequencies exciting primarily the cones with medium and long spectral sensitivities, respectively.

Neural Computation 4, 559-572 (1992)

@ 1992 Massachusetts Institute of Technology

560

J. J. Atick, Z. Li, and A. N. Redlich

In this paper, we demonstrate that the computable theory of early visual processing reported by Atick and Redlich (1990, 1992 henceforth references I and II) can explain this variety of retinal processing types. As explained at length there, the theory hypothesizes that the purpose of retinal coding is to reduce both redundancy and noise in the visual image. The idea of redundancy reduction as an efficiency goal in the sensory system was first proposed by Attneave (1954) and Barlow (1961). In the retina, redundancy in space, time, and color comes from the fact that the pixel by pixel representation of natural scenes, which is the representation formed by the photoreceptors, contains a high degree of correlations among pixels. Therefore, many pixels redundantly represent the same information. Actually with color, there is an additional source of correlation between the photoreceptor outputs coming from the overlapping spectral domains of the three cone types. To improve efficiency, the retina can recode the photoreceptor signal to eliminate correlations in space, time, and color. In refs. I and 11, it was assumed that the retina, being only the first stage in the visual pathway, can eliminate only the simplest form of redundancy, which comes from pixel-pixel correlations: second-order correlation. It makes sense for the retina to eliminate second-order correlation since it is the source of the largest fraction of redundancy in the image, and it can be eliminated relatively easily through a linear transformation that decorrelates the input (photoreceptor) signal. As shown in I and 11, decorrelation together with noise reduction does give a retinal transfer function that agrees with available data from contrast sensitivity experiments. Here we take that analysis one step further and solve for the system that decorrelates at the two point level both in space and color. What we find is that the differences seen in the color coding of primates and fish can be attributed to plausible differences in the color correlation matrix for the two species. More precisely, we note that the degree of overlap between the R and G cones in primates is greater than the corresponding overlap in fish (the R and G spectral sensitivity peaks are separated by only 32 nm for the primates but by 90 nm for the fish). This difference in photoreceptor sampling is well known to be attributed to differences between the primate visual environment and the environment under water (Lythgoe 1979). What we show in this paper is that this sampling difference has very pronounced effects on the subsequent neural processing needed to achieve decorrelation. In fact it will enable us to account for single vs. double opponency coding. In passing, we should mention that we limit our analysis to the two cone ( R and G ) system, since in primate retina these photoreceptors occur with equal density and are more abundant than the blue cones. In fact the blue cones constitute only 15%of the total cone population in the entire retina while in the fovea they are virtually nonexistent. We also confine ourselves to color coding by linear cells, which implies cells in the primate parvocellular pathway.

Understanding Retinal Color Coding

561

It is important to point out, however, that the mixing between space, time, and color that we derive here does not come only from decorrelation. In fact, we use here a correlation matrix which itself does not mix space-time with color, though such mixing in the correlation matrix can easily be included in our analysis and it only accentuates the effect found here - for the more general analysis (see Atick et al. 1990). It is actually noise filtering, together with redundancy reduction, which produces the nontrivial interactions. Noise filtering is a prerequisite for achieving single opponency, and it also explains the observed differences between psychophysical contrast sensitivity measurements in the luminance and chromatic channels. We should point out that the earliest attempt to explain color opponency using decorrelation was made by Buchsbaum and Gottschalk (1983), also inspired by Barlow (1961). However, their work did not include the spatiotemporal dimensions, nor did it include noise, so it does not account for the observed nontrivial coupling of space-time and color.

2 Decorrelation and Color Coding

As in refs. I and II, we make the hypothesis that the purpose of retinal processing is to produce a more efficient representation of the incoming information by reducing redundancy. With the assumption of linear processing, the retina can eliminate only the simplest form of redundancy, namely second-order correlations. However, second-order decorrelation cannot be the only goal of retinal processing, since in the presence of noise as was argued in I1 decorrelation alone would be a dangerous computational strategy. This is due to the fact that after decorrelation both useful signal and noise are coded in a way that makes their distinction no longer possible (they both have the properties of random noise). Thus for decorrelation, or more generally redundancy reduction, to be a viable computational strategy, there must be a guarantee that no significant input noise be allowed to pass. The way we handle this noise here is similar to the approach in II for the purely spatial domain: we first lowpass filter to diminish noise and then decorrelate as if no noise existed. Figure 1 is a schematic of the processing stages we assume take place in the retina. We should emphasize that this is meant to be an effective computational model and is not necessarily a model of anatomically distinct stages in the retina. As shown in the figure, the intensity signal L ( x , t , A), depending on space, time, and spectral wavelength A, is first transduced by the photoreceptors to produce a discrete set of photoreceptor outputs, P y x , t ) = J dACa(A)L(x, t , A)

(2.1)

562

J. J. Atick, Z. Li, and A. N. Redlich

-

Sampling

Low-pass

P’

MU‘

Whitening

Ka’

Figure 1: Schematic of the signal processing stages for the model of the retina used here. At the first stage, images are sampled by the photoreceptors to produce the chromatic signals, Pa. These are subsequently lowpass filtered by Mnb to eliminate noise, and then decorrelated to reduce redundancy by Kab. The functions Cn(A) are the spectral sensitivities of the two (more generally three) photoreceptor types, a = 1 , 2 for R, G, respectively. Following transduction, the photoreceptor signals are lowpass filtered by a function Mab(x,t; x’: t’) to reduce noise. Having filtered out the noise, the final stage in Figure 1 is to reduce the redundancy using the transfer function Kab(x.t; x’, t’) that produces decorrelated retinal outputs. Thus the output 0 is related to the input P through

0 = K .(M (P + n) + no)

(2.2)

where na(x. t) is input noise including transduction and quantum noise, while no(x, t) is noise (e.g., synaptic), which is added following the filter M. Such post-filter noise, though it may be small, must be included because it is very significant from an information theoretic point of view: it sets the scale (accuracy) for measuring the signal at the output of the filter M. We have introduced boldface to denote matrices in the 2 x 2 color space; also in equation 2.2 each . denotes a space-time convolution. To derive both filters M and K, we require some knowledge of the statistical properties of the luminance signals L ( x , t , A): the statistical properties of natural scenes. For our linear analysis here, we only need the chromatic-spatiotemporal autocorrelator, which is a matrix of the form P y x , t;x’, t’) = ( y x l t)Pb(X1t ) ) =

/dXdA’Ca(X)Cb(A’) (L(x, t , X)L(x’, t’, A’))

where ( ) denotes ensemble average. Unfortunately, not much is known experimentally about the entries of the matrix Rab(x,t ; x ’ , t’). Thus, to gain insight into the color coding problem we are forced to make some assumptions. First, we assume translation invariance in space and time: R is then only a function of x - x’, and t - t’, so it can be Fourier transformed to R”(f, w),where f and w are spatial and temporal frequency,

Understanding Retinal Color Coding

563

respectively. Second, we assume Rab(f,w)can be factorized into a pure spatiotemporal correlator times a 2 x 2 matrix describing the degree of overlap between the R and G systems. This second assumption is not absolutely necessary, since it is possible to perform the analysis entirely for the most general form of Rob(f,w)(see Atick et al. 1990). However, this assumption does make it much simpler to analyze and explain our theoretical results. We also examine color coding only under conditions of slow temporal stimulation or near zero temporal frequency. In that case, we do have available Field's (1987) experimental measurement of the spatiotemporal correlator: R(f, 0) = G/lflz with 10the mean background luminance of the input signal. Using this R(f), and making the further simplification that the mean squared R and G photoreceptor outputs are roughly equal, we arrive at (2.3)

where Y < 1 is a parameter describing the amount of overlap of R and G. We should emphasize, that we do not advocate this simple form of Rab as necessarily the one found in nature. More complex Rab can similarly be dealt with but they produce qualitatively similar solutions. The next step is to use this autocorrelator to derive a noise filter Mab(f) (from now on we drop explicit w dependence). In ref. ll, the principle used to derive M(f), without color, was to maximize the mutual information between the output of the filter and the ideal input signal [the signal L(f, w)without noise], while constraining the total entropy of the output signal. The resulting lowpass filter cannot be complete, however, since it does not include the effects of the optics, but these can be incorporated by multiplying by the optical modulation transfer function (MTF). As discussed in detail in ref. 11, in the absence of color (one channel problem), this gives

with N2 the input noise power. Here, the exponential term approximates the optical MTF, which has been independently measured (Campbell and Gubisch 1966); we use typical values for the parameters cy and f c . Although, as shown in ref. 11, this filter matches the spatial experimental data well, other filters can also give good results. For example, one may use a maximum log likelihood principle, equivalent in our case to using mean squared estimation. The really important property all such filters must have, however, is that their shape must depend on the signal to noise ( S I N ) at their input. To see how color correlations (two channel problem) affect the spatiotemporal lowpass filtering, it is helpful to rotate in color space to the

J. J. Atick, Z. Li, and A. N. Redlich

564

basis where the color matrix is diagonal. For the simple color matrix in equation 2.3, this is a 45" rotation by

to the luminance, G + R, and chromatic, G -- R, channels [in vector notation, the red and green channels are denoted by R = (1,O)and G = (0,l)]. In this G & R basis, the total correlation matrix, equation 2.3, plus the contribution due to noise is

where the noise, (nunb)= S a b p , is assumed equal in both the R and G channels, for simplicity. In the G & R basis, the two color channels are decoupled. Thus, the corresponding spatiotemporal filters M* (f) are found by applying our single-channel results, equation 2.4, independently to each channel. The R(f) appropriate to each G f R channel is from equation 2.5,

(2.6) Notice that the two channels differ only in their effective S/N ratios: (S/N)* = J r n ( h / N )

which depend multiplicatively on the color eigenvalues 1 & r. In the luminance channel, G + R, the signal to noise is increased above that in either the R or G channel alone, due to the summation over the R and G signals. The filter M + (f), therefore, passes relatively higher spatial and temporal frequencies, increasing spatiotemporal resolution, than without the R plus G summation. On the other hand, the chromatic channel, G - R, has lower S I N , proportional to 1 - r, so its spatiotemporal filter M - (f) cuts out higher spatial and temporal frequencies, thus sacrificing spatiotemporal resolution in favor of color discriminability. The complete filter in the original basis is finally obtained by rotating from the diagonal basis back by 45" =

1 1 -1 1

2 (1

)(

M+(f) 0

M-(f)

[Again M+(f) is given by equation 2.4 with R(f) --t R+(f) in equation 2.6.1 After filtering noise, the next step is to reduce redundancy using the transfer function K u b ( f , ~ ) . 2At the photoreceptor level, most of the *By redundancy here, as in ref. 11, we mean the difference between the average information H calculated using the joint probabilities for the neural outputs, and the sum of the "bit" entropies Hi, calculated treating the ith neuron completely inde-

xi

Understanding Retinal Color Coding

565

redundancy is due to second-order statistics: autocorrelation. If we ignore noise for the moment, then this redundancy can be eliminated, as shown in ref. 11, by a linear transformation Kab(x- x') that diagonalizes the correlation matrix RQb(x- x') so that at second-order the signals are statistically independent: K . R . KT = D with D a diagonal matrix both in color and space-time. This, does not, however, uniquely specify K since the matrix D is still an arbitrary diagonal matrix. In the spatiotemporal case, we found a unique solution by requiring a translationally invariant, local set of retinal filters: the approximation where all retinal ganglion cells (in some local neighborhood, at least) have the same receptive fields, except translated on the retina, and these fields sum from only a nearby set of photoreceptor inputs. These assumptions force D to be proportional to the unit matrix: D = p l , with proportionality constant p. This gives in frequency space, the whitening filter

In generalizing this to include color, we note that when D is proportional to the unit matrix, the mean squared outputs [ ( K I N T ) % for output o",] of all ganglion cells are equal. This equalization provides efficient use of optic nerve cables (ganglion cell axons) if the set of cables for the cells in a local neighborhood has similar information-carrying capacity. We therefore continue to take D proportional to the identity matrix in the combined space-time-color system. Taking D proportional to the identity, however, leaves a symmetry, since one can still rotate by a 2 x 2 orthogonal matrix Uib, that is, K(f) -+ UoK(f), which leaves D proportional to the identity (Ugh is a constant matrix depending only on one number, the rotation angle; it satisfies UoUi = 1). This freedom to rotate by Uo will be eliminated later by looking at how much information (basically S I N ) is carried by each ganglion cell output. We shall insist that no optic nerves are wasted carrying signals with very low SIN. Returning to the situation with noise, the correlation matrix to be is the one for the signal after filtering by M diagonalized here by Kab(f) (see Fig. 1). To derive Kab(f),we go back to the G R basis where M"(f) is diagonal in color space. Then we can again apply the single-channel analysis from Atick and Redlich (1992) to each channel separately. This gives two functions K+(f) that are chosen to separately whiten the G rt R channels, respectively. Since the complete frequency space correlators in

*

pendently. More precisely, H = - &ljk,,,} P[i,k...} log(P{i,k...)), using the complete joint probabilities P{i,k...} = P(O,, 0,,. . .) for the neural (e.g. photoreceptor) outputs Oi with space-time-color index i, while Hi = - C P(0i) log[P(Oi)]. The difference between H and C,H, measures the amount of statistical dependence of the neural signals on each other: the more dependent, the greater the redundancy, since then more bits effectively carry the same information. Reducing redundancy amounts to finding a transformation on the signals Oi so that after the transformation the ratio H / Hi is lowered.

xi

566

J,

J. Atick, Z . Li, and A. N. Redlich

the two channels after filtering by M*(f) are M:(f)(R*(f) K+(f) are therefore

+ N2)+I$,

the

12

Kdf) =

(2.7)

[M’.

(1)(R* if) +

+ %]

where is the power of the noise that is added following the filter Mub(f) (see e uation 2.2). Equation 2.7 generalizes the whitening filter K(f) = p/R(f) to the case with noise. Now putting equation 2.7 together with equations 2.4 and 2.6, we obtain the complete retinal transfer function the one measured experimentally -

J-4

The right-most matrix transforms the G , R inputs into the G f R basis. These two channels are then separately filtered by K*M*. Finally, the rotation UQ,to be specified shortly, determines the mix of these two channels carried by individual retinal ganglion cells. 3 Properties of Solutions

We now use our theoretical solution (equation 2.8) to explain the observed color processing. Specifically, we now show how such diverse processing types as those found in goldfish and primates are both given by equation 2.8 but for different values of the parameter r in the color correlation matrix. For the case of goldfish, where, as argued in the introduction, one expects only small overlap between R and G (r is small), the two channels in the diagonal basis have eigenvalues 1f r, which are comparable: (I - r ) / ( l + r) 1. This means that both channels will on average be carrying roughly the same amount of information and will handle signals of comparable S I N . Thus the filters K+(f)M+(f)and K-(f)M-(f) are very similar. In fact they are both bandpass filters as shown in Figure 2A for some typical set of parameters. Since these channels are already nearly equalized in S I N , there is no need to rotate them using Ue,so that matrix can be set to unity. Therefore, the complete solution (equation 2.8) when acting on the input vectors R,G , gives two output channels corresponding to two ganglion cell types: N

Z1 = ( G + R ) K+M+ 2 2 = (G -R) K-M-

(3.1)

If we Fourier transform these solutions to get their profiles in space, we arrive at the kernels KRb( x- x’) shown in Figure 3 for some typical set of

Understanding Retinal Color Coding

300

567

F

A

.1

1

100

10

300 7 100

t

B

30 10

3 1 .1

1

10

100

Spatial frequency (c/deg)

Figure 2: (A,B) The luminance and chromatic channels for the goldfish, A, and for primates, B. In both figures the curve that is more bandpass-like is the luminance G + R channel, while the other is the G - R channel. Parameters used are Io/N = 5.0, CY = 1.4, fc = 22.0 c/deg, NO = 1.0 for both A and B. The only difference between the A and B is T : for A, Y = 0.2 while for B, r = 0.85. parameters. The top row is one cell type acting on the R and G signals, and the bottom row is another cell type. These have the properties of double opponency cells. Moving to primates, there is one crucial difference which is the expectation that Y is closer to 1 since the overlap of the spectral sensitivity curves of the red and green is much greater: the ratio of eigenvalues ( l - y ) / ( l + r ) << 1. Since the eigenvalues modify the SIN, this means that the G - R channel has a low SIN while the G + R has much higher SIN. Therefore, K-(f)M-(f) is a lowpass filter while K+(f)M+(f)is bandpass as shown in Figure 2B. These two channels can be identified with the chro-

J. J. Atick, Z. Li, and A. N. Redlich

568

'E

Ip'

.5

-.5 -1 0

-1 0

.2

.4

.6

.8

Normalized distance

1

-2

.4

.6

.8

1

Normalized dlstance

Figure 3: The retinal kernel Kab in the R and G basis predicted by the theory for r = 0.2 (goldfish regime) and for the same parameters used in Figure 2. The top and bottom rows correspond to two different types of retinal ganglion cells predicted by the theory. These cells can be termed double opponent and they are similar to many goldfish ganglion cells. matic and luminance channels measured in psychophysical experiments, respectively. The curves shown in Figure 2B do qualitatively match the results of psychophysical contrast sensitivity experiments (Mullen 1985): namely the lowpass and bandpass properties of the chromatic and luminance curves, respectively. So according to our theory, these differences in spatial processing come from the hierarchy between the color eigenvalues that leads to different spatiotemporal SIN in the two channels.

Understanding Retinal Color Coding

569

Although there is psychophysical evidence that color information in primates under normal conditions is physically organized into luminance and chromatic channels in the cortex (Mullen 19851, this is not how the primate retina transmits the information down the optic nerve (Derrington ef al. 1984). One reason that might explain why the primate retina chooses not to use the G f R basis is that the representation of information in chromatic and luminance channels has one undesirable consequence: If we compute the signal-to-noise ratio as a function of frequency in the chromatic channel, given by (S/N)C = K?M?R-/ [K?(M?hjz l)],and compare it with the corresponding ratio in the luminance channel we find that the ratio ( S / N ) - / ( S / N ) +<< 1 because (1- r)/(l + r ) << 1. So for primates, transmitting the information in the luminance and chromatic basis would result in one channel with very low SIN, or equivalently one channel that does not carry much information. Transmitting information at low SIN down the optic nerve could be dangerous, especially since the optic nerve introduces intrinsic noise of its own; it also may be wasteful of optic nerve hardware. What we propose here is to use the remaining symmetry of multiplication by the rotation matrix UB,to "multiplex" the two channels so they carry the same amount of information, that is, such that they have the same SIN at each frequency. We should point out that the same could have been done for the goldfish but there the two channels (equation 3.1) (G + R)K+(f)M+(f)and (G - R)K- (f)M- (f) already have approximately equal SIN so the degree of multiplexing is very small or ignorable. In the case of primates, where the hierarchy in SIN between the two channels is large the mixing of the two channels will be significant. In fact the angle of rotation needed is approximately 45". This leads finally to the following solutions for the two optimally decorrelated channels with equalized SIN ratios

+

Z1

=

+ R) K+M+ - (G - R)K-MR (K+M+ + K-M-) + G (K+M+ (G + R) K+M+ + (G - R)K-M-

=

R (K+M+ - K - M - )

= (G =

22

-

K-M-)

+ G (K+M++ K - M - )

(3.2)

Since for primates, K+(f)M+(f) and K - (f)M- (f) are very different, the end result is a dramatic mixing of space and color. For example, cell no. 1 at low frequency has K-(f)M-(f) > K+(f)M+(f)so it performs an opponent R - G processing. As the frequency is increased, however, K-(f)M-(f) becomes smaller than K+(f)M+(f)and the cell makes a transition to a smoothing G + R type processing ( Derrington et al. 1984). In Figure 4, we show the filters in frequency space, in the R and G basis. These filters are in principle directly measurable in contrast sensitivity experiments. We view the zero crossing at some frequency as a generic prediction of this theory.

J. J. Atick, Z. Li, and A. N. Redlich

570

K"

-100 r -

300

300

e--

1000

K"

--

300

30

--

3.1

.3

1

3

10

30 100

spatial frequency. c/deg

.1

.3

1

3

10

30 100

Spatiid frequency. c/deg

Figure 4: The predicted retinal filter Kab((f)in the R and G basis for the parameters in Figure 2 with r = 0.85 (primate regime). The solid (dashed) lines represent excitatory (inhibitory) responses. Notice that both cells Z1 and 2 2 make a transition at some frequency from opponent color coding (G - R or R - G ) to nonopponent (G + R). In Figure 5 (dashed line), we show how the solutions look for a typical set of parameters after Fourier transforming back to space. We can see cell type 1 summates red mostly from its center and an opponent green mostly from its surround, while for type 2 the red and green are reversed. These cells can be termed single opponency cells, as seen in primates (Derrington et al. 1984). One might object that the segregation of the red and green in the center is not very dramatic. Actually, this is due to the

571

Understanding Retinal Color Coding

--5L ::L -1 0

.2

.4

.6

.8

Normalized distance

1

0

.2

-4

.6

.8

1

Normallzed distance

Figure 5: The retinal kernel Rbin the R and G basis predicted by the theory for r = 0.85 (primate regime) and for the same parameters used in the Figure 2 (dashed curves). The solid curves use the same parameters with one exception: the parameter NOwas allowed to be different in the luminance and chromatic channels by a factor of two. This was done to illustrate that complete color segregation in the cell's center can be easily achieved.

simplified model we have taken. Complete segregation can be achieved if one allows the synaptic noise parameter NO,which was set to 1 for the dashed line, to be different for the two channels. In fact a difference of 1/2 between the two noises produces the solutions shown by the solid curves in Figure 5.

572

J. J. Atick, Z. Li, and A. N. Redlich

Acknowledgments We wish to thank K. Miller for useful comments o n the manuscript. Work supported in part by a grant from the Seaver Institute.

References Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Cornp. 2, 308-320. Atick, J. J., and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4,196-210; also Quantitative tests of a theory of retinal processing: Contrast sensitivity curves. Preprint no. IASSNS-HEP-90/51. Atick, J. J., Li, Z., and Redlich, A. N. 1990. Color coding and its interaction with spatiotemporal processing in the retina. Preprint no. IASSNS-HEP-90/75. Attneave, F. 1954. Some informational aspects of visuaI perception. Psychol. Rev. 61, 183-193. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed. MIT Press, Cambridge, MA. Buchsbaum, G., and Gottschalk, A. 1983. Trichromacy, opponent colours coding and optimum colour information transmission in the retina. Proc. R. SOC. London Ser. B 220, 89-113. Campbell, F. W., and Gubisch, R. W. 1966. Optical quality of the human eye. J. Physiol. 186, 558-578. Daw, N. W. 1968. Colour-coded ganglion cells in the goldfish retina: Extension of their receptive fields by means of new stimuli. J. Physiol. 197, 567-592. Derrington, A. M., Krauskopf, J., and Lennie, P. 1984. Chromatic mechanisms in lateral geniculate nucleus of macaque. I. Physiol. 357, 241-265. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. I. Opt. SOC.Am. A 4, 2379-2394. Lythgoe, J. N. 1979. The Ecology of Vision. Oxford University Press, Oxford. Mullen, K. T. 1985. The contrast sensitivity of human colour vision to red-green and blue-yellow chromatic gratings. J. Physiol. 359, 381-400.

Received 8 January 1991; accepted 3 January 1992.

This article has been cited by: 2. JEREMY R. MANNING, DAVID H. BRAINARD. 2009. Optimal design of photoreceptor mosaics: Why we do not see color at night. Visual Neuroscience 26:01, 5. [CrossRef] 3. Ali Yoonessi, Frederick A. A. Kingdom, Samih Alqawlaq. 2008. Is color patchy?. Journal of the Optical Society of America A 25:6, 1330. [CrossRef] 4. Ling-Zhi Liao, Si-Wei Luo, Mei Tian. 2007. “Whitenedfaces” Recognition With PCA and ICA. IEEE Signal Processing Letters 14:12, 1008-1011. [CrossRef] 5. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. 2007. Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight DistributionOptimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation 19:3, 639-671. [Abstract] [PDF] [PDF Plus] 6. Jeffrey Ng, Anil A. Bharath, Li Zhaoping. 2007. A Survey of Architecture and Function of the Primary Visual Cortex (V1). EURASIP Journal on Advances in Signal Processing 2007, 1-18. [CrossRef] 7. Jean-Pascal Pfister , Taro Toyoizumi , David Barber , Wulfram Gerstner . 2006. Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised LearningOptimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning. Neural Computation 18:6, 1318-1348. [Abstract] [PDF] [PDF Plus] 8. Yury Petrov, L. Zhaoping. 2003. Local correlations, information redundancy, and sufficient pixel depth in natural images. Journal of the Optical Society of America A 20:1, 56. [CrossRef] 9. R. Lenz. 2002. Two stage principal component analysis of color. IEEE Transactions on Image Processing 11:6, 630-635. [CrossRef] 10. Tassilo von der Twer, Donald I A MacLeod. 2001. Network: Computation in Neural Systems 12:3, 395-407. [CrossRef] 11. Antonio Turiel, Néstor Parga, Daniel Ruderman, Thomas Cronin. 2000. Multiscaling and information content of natural color images. Physical Review E 62:1, 1138-1148. [CrossRef] 12. Daniel L. Ruderman, Thomas W. Cronin, Chuan-Chin Chiao. 1998. Statistics of cone responses to natural images: implications for visual coding. Journal of the Optical Society of America A 15:8, 2036. [CrossRef] 13. Roland Baddeley. 1997. The Correlational Structure of Natural Images and the Calibration of Spatial Representations. Cognitive Science 21:3, 351-372. [CrossRef]

14. Zhaoping Li. 1996. A Theory of the Visual Motion Coding in the Primary Visual CortexA Theory of the Visual Motion Coding in the Primary Visual Cortex. Neural Computation 8:4, 705-730. [Abstract] [PDF] [PDF Plus] 15. Jürgen Schmidhuber, Martin Eldracher, Bernhard Foltin. 1996. Semilinear Predictability Minimization Produces Well-Known Feature DetectorsSemilinear Predictability Minimization Produces Well-Known Feature Detectors. Neural Computation 8:4, 773-786. [Abstract] [PDF] [PDF Plus] 16. Zhaoping Li , Joseph J. Atick . 1994. Toward a Theory of the Striate CortexToward a Theory of the Striate Cortex. Neural Computation 6:1, 127-146. [Abstract] [PDF] [PDF Plus] 17. Dawn M. Adelsberger-Mangan, William B. Levy. 1993. Adaptive synaptogenesis constructs networks that maintain information and reduce statistical dependence. Biological Cybernetics 70:1, 81-87. [CrossRef] 18. A. Norman Redlich . 1993. Supervised Factorial LearningSupervised Factorial Learning. Neural Computation 5:5, 750-766. [Abstract] [PDF] [PDF Plus]

Communicated by Ellen Hildreth

Interaction between Transparency and Structure from Motion Daniel Kersten Department of Psychology, University of Minnesota, Minneapolis, M N 55455 USA

Heinrich H. Bulthoff Bennett L. Schwartz Kenneth J. Kurtz Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912 USA

It is well known that the human visual system can reconstruct depth from simple random-dot displays given binocular disparity or motion information. This fact has lent support to the notion that stereo and structure from motion systems rely on low-level primitives derived from image intensities. In contrast, the judgment of surface transparency is often considered to be a higher-level visual process that, in addition to pictorial cues, utilizes stereo and motion information to separate the transparent from the opaque parts. We describe a new illusion and present psychophysical results that question this sequential view by showing that depth from transparency and opacity can override the bias to see rigid motion. The brain’s computation of transparency may involve a two-way interaction with the computation of structure from motion. 1 Introduction

One of the major challenges of vision research is to understand how the brain constructs a model of the visual environment from the pattern of changing retinal light intensities. With relatively few exceptions (Poggio et al. 1988; Barrow and Tenenbaum 1978), computational research has sought to first divide the problem into noninteracting modules such as surface color from radiance, shape from shading, or structure from motion (Land 1983; Horn and Brooks 1989; Ullman 1979; Martin and Aggarwal 1988). Consistent with the methodology of computer vision, current physiological and psychophysical research indicates modular and concurrent processing for some sources, such as motion, or form and color (Livingstone and Hubel 1987; Cavanagh 1987; Zeki 1978; Van Essen 1985). Neural Computation

4, 573-589 (1992)

@ 1992 Massachusetts Institute of Technology

574

Daniel Kersten et al.

In contrast to the modularity of vision research, it is phenomenally apparent that visual information is eventually integrated to provide a strikingly singular description of the visual environment. The simple unity of visual experience belies some difficult problems of neural computation. One problem is cue in tegration-the combination (possibly linear) of visual information from multiple sources, such as stereo and motion, to compute a single attribute such as depth. A second and theoretically more difficult problem is the cooperative coupliq (typically nonlinear) of the perceptual representations of two or more scene attributes (such as surface depth and material property) to achieve the consistency required by the laws of image formation (Kersten 1991; Biilthoff 1991; Biilthoff and Yuille 1991). Because the outputs from two modules are not independent, algorithms require feedback between modules and thus are open to the problems of convergence and instability (Clark and Yuille 19901.' We describe a new illusion, that has a bistable three-dimensional (3D) interpretation. The bistability is interpreted as the result of cooperative coupling between depth from motion and phenomenal surface transparency. Phenomenal transparency of a surface means we can see it and through it to another background surface. This illusion poses a problem for computational models of structure from motion because it seems to require cooperative coupling or strong fusion between representations of relative surface depth and surface material. Motion provides information about relative depth relationships between surfaces in the world. Information for depth is available from motion parallax and motion disparity (Wallach and OConnell 1953). Theoretical work has shown how structure from motion can be reconstructed with u priori structural biases, such as assuming that the object viewed is rigid (Martin and Aggarwal1988).* Interactions between depth from motion and other depth sources, such as stereo and proximity luminance, have been studied before (Dosher et al. 1986; Nawrot and Blake 1989). With respect to transparency, it has recently been discovered that degree of transparency determines whether two superimposed and independently moving square wave grating patterns at right angles to each other are seen as moving in a single direction or in two independent directions (Ramachandran 1989; Stoner et al. 1990). In these experiments, when the luminance of the intersection of the two gratings was consistent with that derived from a physically transparent grating, the motion of the two gratings was seen to be independent-they appeared to be 'Our distinction between integration and cooperative coupling closely corresponds to the distinction between weak data fusion and strong fusion with recurrency made by Clark and Yuille (1990). 'Motion parallax typically refers to the differences in image speed, due to viewpoint changes, between points at different distances under perspective projection. Because the same retinal optic flow pattern can be induced by moving the object, the term motion parallax is sometimes used in this case too. Structure from motion typically refers to the reconstruction of object geometry due to its motion under orthographic or perspective projection.

Transparency and Structure from Motion

575

sliding over each other. However, if the intersection luminance was not consistent with a physically plausible transparency or with occlusion, the gratings appeared to cohere as a single pattern moving in a unique direction. The conclusion was that motion detecting mechanisms may have tacit knowledge of the physics of transparency. In these studies, the relationship between depth and transparency is only implicit. Although the transparency of a surface implies that it is closer than the surface it covers, one would like to know whether transparency can provide specific depth information that could affect three-dimensional structure from motion. It turns out that particular intensity relationships not only determine whether transparency is seen (Metelli 1974; Richards and Witkin 1979; Beck et al. 19841, but as is shown below, also bias which of two overlapping surfaces is seen in front. We call this depth from transparency. How do depth from motion and transparency interact? In particular, when depth from motion and depth from transparency contradict, which takes precedence-motion or transparency information? 2 Perceptual Observations

In an attempt to answer the above questions we simulated an object consisting of two square planar parallel surfaces that could rigidly rock back and forth at f40" about a common vertical axis midway between them (see Fig. 1 for more details). The planes could be seen as square when in a head-on view, but typically appeared trapezoidal due to perspective. Either the top or bottom face could be made to appear in front of the other depending on apparent transparency and depth from motion. The particular intensity relationships of the four regions bias the apparent transparency of a face, and thus determine the relative depth of the front and back planes. The motion parallax, together with a bias toward rigidity (Wallach ef al. 1953) also affects the depth one sees. In the following we will describe the basic perceptual phenomena, and then detail the results of some quantitative psychophysical measurements. In all three of the demonstrations described below, the rigid motion is described as being consistent with the bottom face being in front of the top face and only the intensities of the various regions are changed. As described in Section 3, the basic observations are unaffected by whether the top or bottom face is in front. First we looked at the case in which both surfaces have zero transparency-that is, they are both opaque with the bottom square in front, and partially occluding the top (Fig. 2a). When the object was rocked back and forth, not surprisingly, observers saw rigid motion that was consistent with both the motion and occlusion cues. The surfaces do and appear to share a common rotation axis that is behind the bottom square, and in front of the top square. Next the intensity of the center patch of overlap was changed to match the top face. Thus the top patch appeared to

576

Daniel Kersten et al.

Figure 1: Animated sequences of images corresponding to a perspective view of two rigidly coupled planar faces (each a simulated 5 x 5 cm square) were generated with a Macintosh I1 computer and displayed on a CRT monitor with a 256 gray-level capacity. The object was rocked back and forth rigidly about the vertical axis passing between the two surfaces and through a point equidistant to both. Like the Necker cube, which is an orthographic projection of a wire cube, a particular image frame can give rise to an ambiguous depth percept: the top face can appear in front or behind the bottom face. The planes oscillated sinusoidally over an amplitude of f40" at 0.48 Hz. The front and back faces were separated by a simulated depth of 6 cm. The centers of the two faces were separated by 2.5 cm in the horizontal and vertical directions. The distance between the point equidistant between the two faces and the observer's eyepoint was 57 cm. There were 21 frames per period.

Transparency and Structure from Motion

577

a. occlusion

b. light / contrast reduce

d. lighter / light

c. contrast reduce / dark

e. light / dark .

f. dark / darker 1

.

' . >

. .

g. dark / dark (top-bottom-equal)

Figure 2: Different transparency types. The first and second words on the label for a transparency type indicate how the top and bottom faces affect the brightness of the patches they cover, respectively. For example, the dark/darker transparency means that both the top and bottom faces darkened what they cover, and that the bottom one was darker than the top. Note that in the actual animation, the nearer square appeared slightly larger because of the perspective projection. Note how the contrast reducing faces (bottom square in b and top square in c) tend to appear nearer the observer.

578

Daniel Kersten et al.

occlude the bottom, in contradiction to the rigid motion, which indicated that the bottom square was in front. Occlusion completely inhibited the rigid interpretation, and we saw the two faces apparently slipping and sliding over one another, appearing approximately in the same plane. The nature of the “slipping and sliding” appearance can be understood as follows. The perspective projection of a single moving square face unambiguously indicates whether the axis of rotation of the face is in front of or behind the face. The bottom square appeared to be rotating about an axis behind it and away from the observer. The top face appeared to be rotating about an axis between it and the observer. Because occlusion indicated the top face was closer to the observer than the bottom face, the two faces could not share the same axis of rotation, and therefore did not appear to be rigidly coupled. The nonrigid percept persists for many minutes. After awhile, observers report that they can see the outside edges of the two surfaces move as if rigidly coupled if they consciously discount the two local T-junctions indicating oc~lusion.~ Of seven observers queried, all reported seeing weak, but definite subjective contours that complete the occluded square behind the center overlapping patch. Interestingly, these faint contours are visible even when nonrigid motion is seen, as if the occluding patch were transparent. Next we relaxed the occlusion cue, by adjusting the intensities of the patches so that one of the two faces appeared transparent. In one case, we adjusted the intensities so that either of the surfaces could appear to be a dark film lying over a light gray background, referred to below as a high contrast “dark/darker” condition (see Table 1). In this condition, even when the surfaces are stationary, the depth relations are ambiguous and bistable, in that either the top or bottom surface may appear in front in a stationary view. A simple model of physical transparency is the multiplication and/or addition of two source images. A dark/darker transparency could arise if the transparency is created by multiplying two source images. In this case, one might expect bistability because multiplication is commutative, so there is no way to decide which of the source images represents a surface that is in front. It is curious to note that the plausible alternative of both surfaces being transparent is never reported, suggesting a default perceptual assumption that minimizes the number of transparent surfaces. One can also adjust the intensities of the top and bottom squares to be equal. In this case the only biases to favor seeing a plane in front are to prefer the bottom over the top, and the larger over the smaller (Fig. 2g). In either the dark/darker or the top-bottom-equal condition, when the two faces were rocked back and forth, we saw a striking bistability. With the bottom face in front, we saw both planes rigidly rocking back and forth with the bottom face appearing transparent, and the top face opaque. After watching this for 2 to 30 sec, suddenly the top face would appear in front and then the 3A T-junction occurs where the edge of the occluder covers the edge of the occluded surface. An X-junction is the image point where transparent and opaque contours cross.

Transparency and Structure from Motion

579

Table 1: Intensity Values (cd/m2) for the transparency types!

Transparency type Dark/dark with top-bottom-equal Occlusion Dark/darker (HC) Contrast reduce/dark (HC) Light/dark (HC) Light/contrast reduce (HC) Lighter/light (HC) Dark/darker (LC) Contrast reduce/dark (LC) Light/dark (LC) Light/contrast reduce (LC) Lighter/light (LC)

Top patch 26

Center patch 16

Bottom patch 26

38 38 38 51 51 38 38 38 51 51 38

16 16 26 26 38 51 16 26 26 38 51

16 26 16 16 26 26 19 23 23 34 46

Background -Contrast (%) Background 51 -24 51 51 51 38 16 16 51 51 38 16 16

0 -24

24 24 19

32 -8.6 6.1 6.1 5.6 5.2

‘ln addition to occlusion and a dark/dark top-bottom-equal transparency, the choice of transparency types was motivated by a consideration of the possible transparencies one can generate permuting four intensities. There are twenty-four possible permutations, but these can be reduced to just six by excluding top/bottom symmetry and the physically implausible contrast reversing and contrast enhancing pairs. Of these six, two involve faces that both darkened the underlying surfaces, so one was eliminated, leaving five. To further increase the range of transparency types, we also added five stimuli in which the Michelson contrast of the lower right-hand corner of the central patch was smaller. The high (HC) and low contrast (LC) groups had contrasts whose absolute values were above 19% and below 8.6%,respectively.

perceived motion was one of two faces slipping and sliding over each other. Simultaneous with this reversal of depth, there was an exchange of surface property-the top face now appeared transparent and the bottom opaque. In the top-bottom-equal condition, there is no depth from transparency bias to see one or the other face in front. Nevertheless, at a given moment, the visual system makes a commitment to one of two plausible relative depth and transparency assignments. The specific default assignment of depth, which lasts for a while and then changes, is similar to what happens when viewing a stationary Necker cube. Our demonstration, in addition, clearly demonstrates that relative depth interacts with apparent surface transparency. In a third demonstration, we sought a condition intermediate between the symmetric transparency of a dark/dark combination and complete occlusion by constructing a transparent overlay that appears diaphanous. A diaphanous transparent square has both additive and multiplicative components that bias its relative depth to be in front of the other square. This can be physically realized by a finely perforated screen whose holes are below the spatial resolution limit and that transmits a fraction of the light coming from behind, and reflects a fraction coming from the front (Kersten 1991; Richards and Witkin 1979). Consistent with the interpretation of a perforated screen, a film that reduces the contrast of the edges

580

Daniel Kersten et al.

it overlays by lightening the darker region, and darkening the lighter, without changing contrast polarity tends to be seen in front (Fig. 2b and c). In the demonstration, the top square was made to appear contrast reducing. The bottom square was made to appear as a dark patch behind the contrast-reducing top square (the high contrast "contrast reduce/dark condition in Table 1, Fig. 2c). When the two faces were rocked back and forth, we saw the wrong motion. Just as in the case of occlusion, the surfaces appeared to slip nonrigidly over one another with the top face appearing in front. After several seconds of observation, suddenly rigid motion is seen at which time the top contrast-reducing square is seen behind a dark bottom film. Again there was a simultaneous and unambiguous reversal of apparent transparency-the contrast-reducing top square suddenly appeared opaque and behind a dark film at the bottom. Our paradigm is potentially useful for studying other interactions between structure from motion and transparency. For example, informal observations have shown that connecting the corners of the two faces with lines (e.g., a wire-frame outline of the cube) can override the nonrigid interpretation induced by weak perspective information. Further, even without connecting wires, an orthographic projection eliminates the nonrigid interpretation. Under orthographic projection, the appearance of the two faces flips between two rigid interpretations, like a Necker cube. In this case, the transparency biases which of the two rigid interpretations are seen. 3 Psychophysics: Response Time vs. Face-in-Front Bias

To quantify the interaction between transparency cues on depth and structure from motion, we made measurements of the reaction time to see rigid motion conditional on the perceived depth relations seen in an initial static view. The time to see rigid motion was measured in two basic conditions in which the initial depth perception, based on transparency, could either conflict (inconsistent condition) or agree (consistent condition) with the subsequent 3D motion. The experimental set-up was as before. By specifying the gray levels of the four image regions, it was possible to vary the transparency, and thus produce a range of biases for whether a face of a particular transparency type appeared in front. This range of biases can be seen in the abscissa of Figure 3 where the proportion of times a particular face (e.g., the darker of a dark/darker combination) is seen in front (the face-in-front bias) ranges from 50 to 100% of the time depending on the transparency type for a given observer. We chose 12 different transparency types summarized in Table 1. The notation for the transparency type indicates how the top and bottom patches affect the brightness of the background. On half of the trials, the top face

Transparency and Structure from Motion

581

LC

W Inconsistent

A

0 Consistent

T

20

Reaction time (s)

"

0.5

0.5

1,. 0.5

0.6

1

0.6

0.7

.

' 0.7

0.9

0.8

.

' 0.8

. 0.9

1.0

'.

1.0

Face-in-front bias

Figure 3: Mean time (fSEM) to see rigid motion plotted against the face-infront bias for two observers (A and B). The face-in-front bias is the proportion of times a particular face appeared in front in the initial static view (e.g., the contrast-reducingface tended to appear in front of whatever face it overlapped). Results from the 12 different transparency conditions are shown. Each point is the mean of 16 measurements, averaged over conditions in which the top and bottom intensities were exchanged. was in front of the bottom face (front-top), as defined by the subsequent motion, and on the other half of the trials, it was behind the bottom face (front-bottom). Because the perspective view made the image of the front patch larger than the back, we balanced for a possible bias by showing the observers the stimuli with the top and bottom intensities "normal" or "exchanged for each of the front-top and front-bottom conditions. Because we could not guarantee that a given transparency condition would

582

Daniel Kersten et al.

generate a consistent depth ordering, the observer was asked to indicate whether the top or bottom surface appeared in front by pushing a button. This button press also initiated the animation of the object. The subject was to push another button once rigid motion was seen. The time to see rigid motion was measured. There were five subjects. One was an inexperienced psychophysical observer, and four were experienced psychophysical observers that were naive as to the purposes of the experiment. Each subject saw each stimulus eight times. The presentation order was randomized. Because the basic pattern of results was similar across the five subjects, we show in Figure 3 the results from just two-the inexperienced observer (LC) and a second observer (DCK). The reaction times were substantially longer when the transparency cues gave depth relations inconsistent with the subsequent rigid motion. Both observers in Figure 3 show the positive correlation between the reaction time (inconsistent condition) and the bias to see a particular face consistently in front in the initial static view. The increase in reaction time for the inconsistent condition and the positive correlation was characteristic of all five observers. There was also a sizeable difference in the range of observers’ reaction times, between 0.5 and 3 sec for one observer (DCK), and between 1 and 30 seconds for the second, naive observer (LC). Apart from occlusion and contrast-reducing transparency, there was no obvious rule to predict the face-in-front bias, which varied with both transparency type and the observer. In all the contrast-reducing conditions, the contrast-reducing face was more likely to be seen in front than not. For example, in the contrast reduce/dark (HC) and the light/contrast reduce (LC) transparencies, the contrast-reducing face was seen in front an average of 98 and 99% of the time, respectively. These figures were averaged over 16 presentations each for all five observers. Apart from occlusion, the agreement between observers was less consistent for the other conditions. This property of contrast reducing faces can be understood in terms of the nature of physics of transparency. For more details, see Kersten and Bulthoff (1991). 4 Discussion

Psychophysical evidence has been presented elsewhere that occlusion information may be represented early in the visual system (Ramachandran and Cavanagh 1985; Nakayama et al. 1989; Shimojo et al. 1989). Our results are consistent with the idea that the determination of what regions the boundary of a surface belongs is done early. We add to this that the attachment of an edge to a region is influenced by transparency. Further, this attachment occurs early enough to affect the perceived depth between two moving surfaces. Other recent studies have indicated that response times to see correct stereo depth, as well as stereo disparity thresholds are also influenced by perceived transparency (Kersten et al.

Transparency and Structure from Motion

583

1989; Kersten 1991; Trueswell and Hayhoe 1991; see also Nakayama et al. 1989). What are the implications of our psychophysical observations for neural computation? The perceptual observations presented above suggest criteria that a model for the perception of surface transparency must eventually have to satisfy. Specifically, the model should handle (1)resolution of multiple locally ambiguous image data, (2) integration of depth from transparency and depth from motion, (3) multistable perception, and (4) cooperative computation of surface material and relative depth. Although there are models that incorporate some of these constraints, we know of no model that satisfies all of these criteria. Ultimately, a good neural model should also make some predictions concerning the neurophysiology. 4.1 Resolution of Multiple Locally Ambiguous Image Data. It seems unlikely that any neural model that works by propagating numerous local constraints could account for our psychophysical data, and the perception of transparency in general. The basic fact to contend with is that the local ambiguity at the two distantly separated X-junctions always seems to be resolved by the human visual system. For example, at a given instant, the relative depth of an edge in each X-junction of the top-bottom-equal stimulus (Fig. 2g), has a 50% chance of being in front of (or behind) the other edge it crosses. Any model that propagates local constraints along a contour, such as a Markov Random Field model for transparency (Kersten 19911, has a 50% chance that the two X-junctions are inconsistent. That is the vertical edge of one square is transparent, and the horizontal edge of the same square is opaque. This corresponds to a perception that is never seen. Iterative methods using various convergence tricks can avoid this problem and solve simple transparencies (Kersten 1991). However, any method that requires a large number of iterations to converge is an improbable algorithm for a perception that is fast enough to interact with structure from motion. A possible solution is to compute a relatively small number of intermediate surface representations, and then choose the most probable configurations. 4.2 Integration of Depth Cues. Part of the problem of the interaction between depth from motion and transparency is a cue integration problem (Terzopoulos1986; Nawrot and Blake 1991; Bulthoff and Mallot 1988; Maloney and Landy 1989). In Bulthoff and Mallot’s studies of depth integration, depth from shading and stereo was shown to accumulate, gradually increasing the perceived curvature of a smooth convex surface when the cues were consistent; but as in the present study, inconsistent cues were not resolved by averaging. One could imagine an accumulation of depth from transparency-a gradual increase in the contrast reduction of a planar surface mixing with the depth from motion to produce an intermediate relative depth. But this does not happen. The perceived depth

584

Daniel Kersten et al.

is fixed until suddenly it flips. This is analogous to the flip in the perception of the rotation of a clear transparent sphere that is covered with dots, and viewed under orthographic projection. The rotation ambiguity is resolved by stereo (Nawrot and Blake 1989). Nawrot and Blake have designed a network model of integration in which the sources of depth information (binocular and motion disparity) are at the same or corresponding image locations (Nawrot and Blake 1991). However, in our transparent stimuli, the transparency cues are at the X-junctions, and the depth from motion (and perspective) involves the movement of the Xjunctions as well as the distant corners. There is no local motion or static intensity information at an X-junction to disambiguate the relative depth. Again, this argues for an intermediate explicit surface representation in a model of cooperativity involving material and depth attributes. 4.3 Multistability. One way of viewing multistability is in terms of the brain constructing an a posteriori probability of the world’s state of affairs conditional on the image data (Kersten 1991). Multistability is reflected in multiple discrete modes of the probability distribution. These modes can also be identified with local minima in an energy function. But computing even a single mode can be problematic because of the local minima and convergence problems faced with nonlinear coupled networks. It is not too difficult to construct a posterior distribution or energy function that has the right modes, but it is difficult to construct one that has only the few correct modes. Further this Bayesian formulation does not answer the mystery of how the switch is made from one mode to the next. A number of the properties of perceptual multistability have been paralleled in simulated neural networks and algorithms (Kawamoto and Anderson 1985; Ditzinger and Haken 1989; Nawrot and Blake 1991). These models employ oscillating nonlinear dynamical systems, adaptation in unit activity level, or synaptic unlearning (e.g., via anti-Hebbian modification). However, exactly what happens in the brain during the flip between opacity and transparency is a puzzle for the future. 4.4 Cooperative Computation of Surface Material and Relative Depth. Our main point in this paper is that the striking bistability of the perceived motion together with the perceptual interchange of transparency and depth strongly suggest that surface transparency and relative depth are explicitly represented in the brain, and that they are computed cooperatively. The need for computing multiple interacting representations in vision has been pointed out by Barrow and Tenenbaum (1978), studied in the “Vision Machine” project at M.I.T. (Poggio et al. 1990) and supported psychophysically by studies in the perception of lightness (Gilchrist 1977). There have been only a few attempts at relating cooperative computation of scene attributes to psychophysics (Kersten and Plummer 1988; Adelson et al. 1989; Kersten 1991).

Transparency and Structure from Motion

585

Both the Bayesian approach, and our results are consistent with the general notion of the visual system picking rational and plausible interpretations of scene properties causing the image. A good example for a rational interpretation of scene attributes is given by Blake and Biilthoff (1990). They showed that a binocularly viewed curved surface is only perceived as glossy if the specular highlight is close to the correct distance (according to ray optics and differential geometry) from the surface. Their work is also an excellent example for the disambiguation power of cue integration-the convex/concave ambiguity of shape-from-shading can be disambiguated by information about the position of specular highlights (Blake and Biilthoff 1990, 1991). Jepson and Richards (1991) propose a logical framework for cooperative computation between modules. The idea is to choose that interpretation of the world which jointly satisfies the premises about scene structure (e.g., rigidity) from the various vision modules, while avoiding unnecessary faulting of these premises. This is related to the work of Clark and Yuille (19901, who give a thorough discussion of the problems of data fusion between modules within a probabilistic framework.

4.5 Neurophysiology. Neurons in V2 of monkey visual cortex have been shown to respond to subjective contours, suggesting that these cells may have an important role in detecting surface occlusion (von der Heydt et al. 1984; von der Heydt and Peterhans 1989; Nakayama and Shimojo 1990). These V2 neurons are hypothesized to detect conjunctions in the outputs of spatially separate end-stopped neurons with the same preferred orientation (Peterhans et al. 1986). To signal objective contours, the model neuron also receives additive input from a complex cell whose orientation is perpendicular to the preferred orientation of the end-stopped cells. In light of the above arguments for explicit representation of transparent surfaces, it might be reasonable to look for neural responses to edges that depend on the monkey’s perception of what surface region belongs to the edge. For example, some neurons (perhaps as early as V2) may be responsive to an edge constituting a transparency X-junction only when that edge belongs to a surface that appears in front of a second surface. There might be a bias to respond to the contrast reducing edge of an X-junction because it is likely to be in front of the background surface. These responses, if found, would presumably be influenced by the global structure of the surface (e.g., X-junctions outside of the classic receptive field). Responses to a local X-junction produced by a pair of moving transparent surfaces (as in Fig. 1)may also be expected to depend on changes in perceived depth induced by structure from motion. This is certainly possible, given the existence of feedback connections from MT to V2. MT is an area implicated in global motion processing (Allman et aI. 1985; Movshon et al. 1985). One could also look for neural responses contingent on surface depth from motion. It has been shown by Snow-

586

Daniel Kersten et al.

den et al. that MT may play a role in the perception of transparent surface motion in random dot displays (Snowden et al. 1991). Eventually the brain’s computation of visual surfaces must incorporate information regarding the relations between surfaces. One way to do this would be to assume that the output of the above hypothetical V2 subjective contour unit combine with units signaling relative depth information. Alternatively, one could abandon the strictly bottom-up subjective contour model and incorporate feedback involving global surface information, such as that obtained from a motion module. In Section 2, we described the presence of faint subjective contours that completed the occluded square during motion. This observation is also suggestive of a possible role of feedback from a motion area, such as MT, to V2 in the process of surface computation. Acknowledgments This work was supported by NSF Grants BNS-8708532 and BNS-9109514 to Daniel Kersten, a grant from the Graduate School of the University of Minnesota, and AFOSR-90-0274. This research began at the Center for Biological Information Processing at M.I.T., which is supported in part by the Office of Naval Research, Cognitive and Neural Sciences Division, Grant N00014-88-K-0164,and in part by the National Science Foundation, Grant IRI-8719394. The authors thank Momi Furuya for data collected in pilot studies, David Knill and Irving Biederman for useful comments and suggestions, and Edward Adelson who first showed us the striking effect contrast reduction can have on depth ordering of transparencies at the 1989 meeting on Computational Models of Visual Processing at Cold Spring Harbor Laboratory. References Adelson, E. H., Pentland, A., and Juo, J. 1989. The extraction of shading and reflectance. Invest. Ophthalmol. Visual Sci. Suppl. 30(3), 262. Allman, J., Miezin, F., and McGuinness, E. 1985. Direction-and velocity-specific responses from beyond the classical receptive field in the middle temporal visual area (MT).Perception 14, 105-126. Barrow, H. G., and Tenenbaum,J. M. 1978. Recovering Intrinsic Scene Characteristics from Images. In Computer VisionSystems, A. R. Hanson, and E. M. Riseman, (eds.),pp. 3-26. Academic Press, New York. Beck, J., Prazdny, K., and Ivry, R. 1984. The perception of transparency with achromatic colors. Percept. Psychophys. 35(4), 407422. Blake, A., and Biilthoff, H. H. 1990. Does the brain know the physics of specular reflection? Nature Kondon) 343, 165-169. Blake, A., and Bulthoff, H. H. 1991. Shape from specularities: Computation and Psychophysics. Phil.Transact. R. SOC.(London) Ser. B 331, 237-252.

Transparency and Structure from Motion

587

Bulthoff, H. H. 1991. Shape from X: Stereo, shading, texture, specularity. In Computational Models of Visual Processing. MIT Press, Cambridge, MA. Biilthoff, H. H., and Mallot, H. A. 1988. Integration of depth modules: Stereo and shading. I. Opt. SOC.Am. A 5(10), 1749-1758. Bulthoff, H. H., and Yuille, A. 1991. Bayesian models for seeing surfaces and depth. Comments Theor. Biol. 2(4), 283-314. Cavanagh, P. 1987. Reconstructing the third dimension: Interactions between color, texture, motion, binocular disparity and shape. Comput. Vision Graphics Image Process. 37, 171-195. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing. Kluwer Academic Publishers, Boston. Ditzinger, T., and Haken, H. 1989. Oscillations in the perception of ambiguous patterns. Biol. Cybernet. 61, 279-287. Dosher, B. A., Sperling, G., and Wurst, S. A. 1986. Tradeoffs between stereopsis and proximity luminance covariance as determinants of perceived 3D structure. Vision Res. 26(6), 973-990. Gilchrist, A. L. 1977. Perceived Lightness Depends on Perceived Spatial Arrangement. Science 195, 185-187. Horn, B. K. P., and Brooks, M. J. 1989. Shapefrom Shading. MIT Press, Cambridge, MA. Jepson, A., and Richards, W. 1991. Lattice framework for integrating vision modules. To appear in June issue of IEEE Syst. Man Cybern. Kawamoto, A. H., and Anderson, J. A. 1985. A neural network model of multistable perception. Acta Psychol. 59, 35-65. Kersten, D. 1991. Transparency and the cooperative computation of scene attributes. In Computational Models of Visual Processing, M. Landy, and A. Movshon, eds., pp. 209-228. MIT Press, Cambridge, MA. Kersten, D., and Biilthoff, H. H. 1991. Transparency affects structure from motion. Tech. Rep. 34, Center for Biological Information Processing, Massachusetts Institute of Technology. Kersten, D., and Plummer, D. J. 1988. Reflectance estimation in the presence of sharp shadows or transparency. International Neural Network Society. Boston. Kersten, D., Bulthoff, H. H., and Furuya, M. 1989. Apparent opacity affects perception of structure from motion and stereo. Investi. Ophthalmol. Visual Sci. Suppl. 30(3), 264. Land, E. H. 1983. Recent advances in retinex theory and some implications for cortical computations: Color vision and the natural image. Proc. Natl. Acud. Sci. U.S.A. 80, 5163-5169. Livingstone, M. S., and Hubel, D. H. 1987. Psychophysical evidence for separate channels for the perception of form, color, movement and depth. I. Neurosci. 7(11), 3416-3468. Maloney, L. T., and Landy, M. S. 1989. A statistical framework for robust fusion of depth information. SPIE Visual Commun. Image Process. 1199, 1154-1163. Martin, W. N., and Aggarwal, J. K. 1988. Motion Understanding: Robot and Human Vision. Kluwer Academic Publishers, Boston. Metelli, F. 1974. The perception of transparency. Sci. Am. 230(4), 91-98. Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. 1986. The anal-

588

Daniel Kersten et al.

ysis of moving visual patterns. In Pattern Recognition Mechanisms, C. Chagas, R. Gattas, and C. G. Gross, eds., pp. 117-151. Vatican Press, Rome, Italy. Nakayama, K., and Shimojo, S. 1990. Toward a neural understanding of visual surface representation. Brain: Proc. Cold Spring Harbor Symp. Quant. Biol. 55, 911-924. Nakayama, K., Shimojo, S., and Ramachandran, V. S. 1989. Depth, subjective contours and transparency: Relation to neon color spreading. Invest. Ophthalmol. Visual Sci. Suppl. 30(3), 255. Nawrot, M., and Blake, R. 1989. Neural integration of information specifying structure from stereopsis and motion. Science 244, 716-718. Nawrot, M., and Blake, R. 1991. A neural network model of kinetic depth. Visual Neurosci. 6, 219-227. Peterhans, E., von der Heydt, R., and Baumgartner, G. 1986. Neuronal responses to illusory contours stimuli reveal stages of visual cortical processing. In Visual Neuroscience, J. Pettigrew, K. J. Sanderson, and W. R. Levick, eds., pp. 343-351. Cambridge University Press, Cambridge, England. Poggio, T., Gamble, E. B., and Little, J. J. 1988. Parallel integration of vision modules. Science 242, 436-440. Poggio, T., Little, J., Gamble, E., Gillett, W., Geiger, D., Weinshall, D., Villalba, M., Larson, N., Cass, T., Biilthoff, H., Drumheller, M., Oppenheimer, P., Yang, W., and Hurlbert, A. 1990. The M. I. T. Vision Machine. In Artificial Intelligence at MIT: Expanding Frontiers-Volume 2, P. H. Winston, and S. A. Shellard, eds., pp. 492-529. MIT Press, Cambridge, MA. Ramachandran, V. 1989. Constraints imposed by occlusion and image segmentation. Vision and Three-dimensional Representation. University of Minnesota, Minneapolis, MN. Ramachandran, V. S., and Cavanagh, P. 1985. Subjective contours capture stereopsis. Nature (London) 317,527-530. Richards, W., and Witkin, A. I? 1979. Efftcierzt Computations and Representations of Visible Surfaces. Tech. Rep. AFOSR-79-0020. Massachusetts Institute of Technology. Shimojo, S., Silverman, G. H., and Nakayama, K. 1989. Occlusion and the solution to the aperture problem for motion. Vision Res. 29(5), 619-626. Snowden, R. J., Treue, S., Erickson, R. G., and Andersen, R. A. 1991. The response of area MT and V1 neurons to transparent motion. J. Neurosci. 11(9),2768-2785. Stoner, G. R., Albright, T. D., and Ramachandran, V. S. 1990. Transparency and coherence in human motion perception. Nature (London) 344, 155-157. Terzopoulos, D. 1986. Integrating visual information from multiple sources. In From Pixels to Predicates, A. Pentland, ed., pp. 111-142. Ablex Publishing Corporation, Norwood, NH. Trueswell, J. C., and Hayhoe, M. 1991. Transparency interacts with binocular disparity to determine perceived depth. Invest. Ophfhalmol. Visual Sci. Suppl. 32(4), 694. Ullman, S. 1979. The interpretation of structure from motion. Proc. R. SOC.London Ser. B 203, 405426.

Transparency and Structure from Motion

589

Van Essen, D. C. 1985. Functional organization of primate visual cortex. Cerebral Cortex 3, 259-329. von der Heydt, R., and Peterhans, E. 1989. Mechanisms of contour perception in monkey visual cortex. I. Lines of pattern discontinuity. 1. Neurosci. 9, 1731-1 748. von der Heydt, R., Peterhans, E., and Baumgartner, G. 1984. Illusory contours and cortical neuron responses. Science 224, 1260-1262. Wallach, H., and OConnell, D. N. 1953. The kinetic depth effect. 1. Exp. Psycho/. 45, 205-217. Zeki, S. 1978. Functional specialization in the visual cortex of the Rhesus monkey. Nature (London) 274,423-428.

Received 1 July 1991; accepted 17 January 1992.

This article has been cited by: 2. Raymond van Ee, Wendy J. Adams, Pascal Mamassian. 2003. Bayesian modeling of cue interaction: bistability in stereoscopic slant perception. Journal of the Optical Society of America A 20:7, 1398. [CrossRef] 3. Scott P. Johnson, Richard N. Aslin. 2000. Infants' perception of transparency. Developmental Psychology 36:6, 808-816. [CrossRef] 4. Gene R Stoner, Thomas D. Albright. 1993. Image Segmentation Cues in Motion Processing: Implications for Modularity in VisionImage Segmentation Cues in Motion Processing: Implications for Modularity in Vision. Journal of Cognitive Neuroscience 5:2, 129-149. [Abstract] [PDF] [PDF Plus]

Communicated by Terrence J. Sejnowski

Information-Based Objective Functions for Active Data Selection David J. C. MacKay* Computation and Neural Systems, California Institute of Technology 139-74, Pasadena, C A 91125 U S A

Learning can be made more efficient if we can actively select particularly salient data points. Within a Bayesian learning framework, objective functions are discussed that measure the expected informativeness of candidate measurements. Three alternative specifications of what we want to gain information about lead to three different criteria for data selection. All these criteria depend on the assumption that the hypothesis space is correct, which may prove to be their main weakness. 1 Introduction

Theories for data modeling often assume that the data are provided by a source that we do not control. However, there are two scenarios in which we are able to actively select training data. In the first, data measurements are relatively expensive or slow, and we want to know where to look next so as to learn as much as possible. According to Jaynes (19861, Bayesian reasoning was first applied to this problem two centuries ago by Laplace, who in consequence made more important discoveries in celestial mechanics than anyone else. In the second scenario, there is an immense amount of data and we wish to select a subset of data points that is most useful for our purposes. Both these scenarios will benefit if we have ways of objectively estimating the utility of candidate data points. The problem of "active learning" or "sequential design" has been extensively studied in economic theory and statistics (El-Gamal 1991; Fedorov 1972). Experimental design within a Bayesian framework using the Shannon information as an objective function has been studied by Lindley (1956) and by Luttrell (1985). A distinctive feature of this approach is that it renders the optimization of the experimental design independent of the "tests" that are to be applied to the data and the loss functions associated with any decisions. This paper uses similar information-based 'Present address: Cavendish Laboratory, Madingley Road, Cambridge, CB3 OHE, United Kingdom. E-mail: [email protected].

Neural Computation

4,590-604 (1992)

@ 1992 Massachusetts Institute of Technology

Objective Functions for Data Selection

591

objective functions and discusses the problem of optimal data selection within the Bayesian framework for interpolation described in previous papers (MacKay 1992a,b). Most of the results in this paper have direct analogs in Fedorov (19721, though the quantities involved have different interpretations: for example, Fedorov’s dispersion of an estimator becomes the Bayesian’s posterior variance of the parameter. This work was directly stimulated by a presentation given by John Skilling at Maxent 91 (Skilling 1992). Recent work in the neural networks literature on active data selection, also known as “query learning,” has concentrated on slightly different problems: The work of Baum (1991) and Hwang et al. (1991) relates to perfectly separable classification problems only; in both these papers a sensible query-based learning algorithm is proposed, and empirical results of the algorithm are reported; Baum also gives a convergence proof. But since the algorithms are both human designed, it is not clear what objective function their querying strategy optimizes, nor how the algorithms could be improved. In contrast, this paper (which discusses noisy interpolation problems) derives criteria from defined objective functions; each objective function leads to a different data selection criterion. A future paper will discuss the application of the same ideas to classification problems (MacKay 1992~). Plutowski and White (1991) study a different problem from the above, in the context of noise-free interpolation: they assume that a large amount of data has already been gathered, and work on principles for selecting a subset of that data for efficient training; the entire data set (inputs and targets) is consulted at each iteration to decide which example to add to the training subset, an option that is not permitted in this paper. 1.1 Statement of the Problem. Imagine that we are athering data in the form of a set of input-output pairs DN = { ~ ( ~ ) , t (where ~ ~ ) ,m = 1 . . .N. These data are modeled with an interpolant y(x; w, A). An interpolation model H specifies the “architecture” A, which defines the functional dependence of the interpolant on the parameters wi, i = 1. . .k. The model also specifies a regularizer, or prior on w, and a cost function, or noise model N describing the expected relationship between y and t. We may have more than one interpolation model, which may be linear or nonlinear in w. Two previous papers (MacKay, 1992a,b) described the Bayesian framework for fitting and comparing such models, assuming a fixed data set. This paper discusses how the same framework for interpolation relates to the task of selecting what data to gather next. Our criterion for how informative a new datum is will depend on what we are interested in. Several alternatives spring to mind:

1. If we have decided to use one particular interpolation model, we might wish to select new data points to be maximally informative about the values that that model’s parameters w should take.

592

David J. C. MacKay

2. Alternatively, we might not be interested in getting a globally welldetermined interpolant; we might only want to be able to predict the value of the interpolant accurately in a limited region, perhaps at a point in input space that we are not able to sample directly. 3. Lastly, we might be unsure which of two or more models is the best interpolation model, and we might want to select data so as to give us maximal information to discriminate between the models.

This paper will study each of these tasks for the case in which we wish to evaluate the utility as a function of xN+', the input location at which a single measurement of a scalar tN+' will be made. The more complex task of selecting multiple new data points will not be addressed here, but the methods used can be generalized to solve this task, as is discussed in Fedorov (1972) and Luttrell (1985). The similar problem of choosing the x N + ] at which a vector of outputs tN+' is measured will not be addressed either. The first and third definitions of information gain have both been studied in the abstract by Lindley (1956). All three cases have been studied by Fedorov (1972), mainly in non-Bayesian terms. In this paper, solutions will be obtained for the interpolation problem by using a gaussian approximation and in some cases assuming that the new datum is a relatively weak piece of information. In common with most other work on active learning, the utility is evaluated assuming that the probability distributions defined by the interpolation model are correct. For some models, this assumption may be the Achilles' heel of this approach, as discussed in Section 6. 1.2 Can Our Choice Bias Our Inferences? One might speculate that the way we choose to gather data might be able to bias our inferences systematically away from the truth. If this were the case we might need to make our inferences in a way that undoes such biases by taking into account how we gathered the data. In orthodox statistics many estimators and statistical tests do depend on the sampling strategy. However, the likelihood principle states that our inferences should depend on the likelihood of the actual data received, not on other data that we might have gathered but did not. Bayesian inference is consistent with this principle; there is no need to undo biases introduced by the data collecting strategy, because it is nut possible for such biases to be introduced-as long as we perform inference using all the data gathered (Berger 1985; Loredo 1989). When the models are concerned with estimating the distribution of output variables t given input variables x, we are allowed to look at the x value of a datum, and decide whether or not to include the datum in the data set. This will not bias our inferences about the distribution P(t 1 x).

Objective Functions for Data Selection

593

2 Choice of Information Measure

Before we can start, we need to select a measure of the information gained about an unknown variable when we receive the new datum tN+'. Having chosen such a measure we will then select the xN+' for which the expected information gain is maximal. Two measures of information have been suggested, both based on Shannon's entropy, whose properties as a sensible information measure are well known. Let us explore this choice for the first task, where we want to gain maximal information about the parameters of the interpolant, w. Let the probability distributions of the parameters before and after we receive the datum tN+' be PN(w)and PN+'(w). Then the change in entropy of the distribution is AS = S N - &+I, where (2.1) where rn is the measure on w that makes the argument of the log dimensionless.' The greater AS is, the more information we have gained about w. In the case of the quadratic models discussed in (MacKay 1992a1, if we set the measure m(w) equal to the prior F'"(w)'the quantity S N is closely related to the log of the "Occam f a ~ t o r . " ~ An alternative information measure is the cross entropy between PN(w) and PN+'(w):

Let us define G' = -G so as to obtain a positive quantity; then G' is a measure of how much information we gain when we are informed that the true distribution of w is J"+'(W), rather than PN(w). These two information measures are not equal. Intuitively they differ in that if the measure m(w) is flat, AS only quantifies how much the probability "bubble" of P(w) shrinks when the new datum arrives; G' also incorporates a measure of how much the bubble moves because of the new datum. Thus according to G', even if the probability distribution does not shrink and become more certain, we have learned something if the distribution moves from one region to another in w-space. The question of which information measure is appropriate is potentially complicated by the fact that G' is not a consistent additive measure of information: if we receive datum A then datum B, in general, GkB # GA + GL. This intriguing complication will not, however, hinder our task: we can only base our decisions on the expectations of AS 2This measure rn will be unimportant in what follows but is included to avoid committing dimensional crimes. Note that the sign of AS has been defined so that our information gain corresponds to positive AS. 'If the Occam factor is O.F. = ( 2 ~ ) ~ / ~ d e t - ' ' *exp(-crE$p)/Z~(a), A then SN = log O.F. y/2, using notation from MacKay (1992a).

+

David J. C. MacKay

594

and G'; we will now see that in expectation AS and G' are equal, so for our purposes there is no distinction between them. This result holds independent of the details of the models we study and independent of any gaussian approximation for P(w). Proof. E(AS) = E(G'). To evaluate the expectation of these quantities, we have to assume a probability distribution from which the datum fN+l (hence abbreviated as t) comes. We will define this probability distribution by assuming that our current model, complete with its error bars, is correct. This means that the probability distribution of t is P(t I D N , Z ) , where 'H is the total specification of our model. The conditioning variables on the right will be omitted in the following proof. We can now compare the expectations of AS and G'.

where m is free to be any measure on w; let us make it the same measure m as in equation 2.1. Then the first term in equation 2.3 is -&+I. So E(G') = -E(SN+l)

=

+ /dtP(t)

/dkwP(w 1 t)log-m(w) P(W)

+

E ( - S N + ~ S N ) = E(AS)

Thus the two candidate information measures are equivalent for our purposes. This proof also implicitly demonstrates that E(AS) is independent of the measure m(w). Other properties of E(AS) are proved in Lindley (1956). The rest of this paper will use AS as the information measure, with m(w) set to a constant. 3 Maximizing Total Information Gain ___

Let us now solve the first task: how to choose xN+l so that the expected information gain about w is maximized. Intuitively we expect that we will learn most about the interpolant by gathering data at the x location where our error bars on the interpolant are currently greatest. Within the quadratic approximation, we will now confirm that intuition. 3.1 Notation. The likelihood of the data is defined in terms of a noise level 0: = B-' by P({t} I w,,B,N) = exp[-PE~(w)]/Z,, where ED(w) = C , $ [t"' - y(x(");w)]', and ZD is the appropriate normalizing constant. The likelihood could also be defined with an x-dependent noise level @-'(x),or correlated noise in multiple outputs (in which case P-' would

Objective Functions for Data Selection

595

be the covariance matrix of the noise). From here on y will be treated as a scalar y for simplicity. When the likelihood for the first N data-is combined with a prior P(w I a , R ) = exp[-aEw(w)]/Zw, in which the regularizing constant (or weight decay rate) cy corresponds to the prior expected smoothness of the interpolant, we obtain our current probability distribution for w, F"(w)= exp[-M(w)]/ZM, where M(w) = nEw ,BED. The objective function M(w) can be quadratically approximated near to the most probable parameter vector, W M ~ by ,

+

1 M(w) 21 W(W)= M ( w M ~ ) -AW'AAW 2

+

(3.1)

where Aw = w - W M and ~ the Hessian A = VVM is evaluated at the minimum wMP. We will use this quadratic approximation from here on. If M has other minima, those can be treated as distinct models as in MacKay (199213). First we will need to know what the entropy of a gaussian distribution is. It is easy to confirm that if P(w) 0: ecM (w), then for a flat measure m(w)= m, S = -(1+log2s)+-log(m2detA-') k 1

2

2

(3.2)

Thus our aim in minimizing S is to make the size of the joint error bars on the parameters, det A-l, as small as possible. Expanding y around wMP, let

where g, = l?yy/awj is the (x-dependent) sensitivity of the output variable to parameter wj,evaluated at W M ~ . Now imagine that we choose a particular input x and collect a new datum. If the datum t falls in the region such that our quadratic approximation applies, the new Hessian AN+*is

AN+I5 A

f

PggT

(3.4)

where we have used the approximation VVi[t - y(x;w)I2 N gg'. This expression neglects terms in d2y/dw,dwk; those terms are exactly zero for the linear models discussed in MacKay (1992a),but they are not necessarily negligible for nonlinear models such as neural networks. Notice that this new Hessian is independent of the value that the datum t actually takes, so we can specify what the information gain AS will be for any datum, because we can evaluate AN+^ just by calculating g. Let us now see what property of a datum causes it to be maximally informative. The new entropy SN+I is equal to -1/2 log (m2 det AN+l),

David J. C. MacKay

596

neglecting additive constants. This determinant can be analytically evaluated (Fedorov 1972), using the identities

and det [A

+ &gT] = (detA)(l+ PgTA-’g)

(3.5)

from which we obtain: Total information gain

= =

1

-A log (mZdet A) 2 1 - lOg(1 ,OgTA-’g) 2

+

(3.6)

In the product PgTA-’g, the first term tells us that, not surprisingly, we learn more information if we make a low noise (high /I)measurement. The second term gTA-’g is precisely the variance of the interpolant at the point where the datum is collected. Thus we have our first result: to obtain maximal information about the interpolant, take the next datum at the point where the error bars on the interpolant are currently largest (assuming the noise C T ~on all measurements is the same). This rule is the same as that resulting from the “D-optimal” and ”minimax” design criteria (Fedorov 1972). For many interpolation models, the error bars are largest beyond the most extreme points where data have been gathered. This first criterion would in those cases lead us to repeatedly gather data at the edges of the input space, which might be considered non-ideal behavior; but we do not necessarily need to introduce an ad hoc procedure to avoid this. The reason we do not want repeated sampling at the edges is that we do not want to h o w what happens there. Accordingly, we can derive criteria from alternative objective functions which only value information acquired about the interpolant in a defined region of interest. 4 Maximizing Information about the Interpolant in a Region of Interest

Thus we come to the second task. First assume we wish to gain maximal information about the value of the interpolant at a particular point x ( ~ ) .Under the quadratic approximation, our uncertainty about the interpolant y has a gaussian distribution, and the size of the error bars is given in terms of the Hessian of the parameters by

where g(u) is i3y/dw evaluated at x(”). As above, the entropy of this gaussian distribution is 1/2 logg: + const. After a measurement t is

Objective Functions for Data Selection

597

made at x where the sensitivity is g, these error bars are scaled down by a factor of 1 - p2, where p is the correlation between the variables f and y'")',given by p2 = [gTA-'g(u)]2/[~i(~2 + u : ) ] , where 0,'= gTA-'g. Thus the information gain about y(") is Marginal information gain

=

I

-A log 2

02

The term gTA-'giU)is maximized when the sensitivities g and g(u)are maximally correlated, as measured by their inner product in the metric defined by A-'. The second task is thus solved for the case of extrapolation to a single point. This objective function is demonstrated and criticized in Section 6. 4.1 Generalization to Multiple Points. Now imagine that the objective function is defined to be the information gained about the interpolant at a set of points {x'")}. These points should be thought of as representatives of the region of interest, for example, points in a test set. This case also includes the generalization to more than one output variable y; however, the full generalization, to optimization of an experiment in which many measurements are made, will not be made here (see Fedorov 1972 and Luttrell 1985). The preceding objective function, the information about y("), can be generalized in several ways, some of which lead to dissatisfactory results. 4.1.1 First Objective Function for Multiple Points. An obvious objective function is the joint enfropy of the output variables that we are interested in. Let the set of output variables for which we want to minimize the uncertainty be {y(")},where u = 1 . . . V runs either over a sequence of different input locations .("I, or over a set of different scalar outputs, or both. Let the sensitivities of these outputs to the parameters be g(,). Then the covariance matrix of the values {y'")} is Y

=G

~A-~G

(4.2)

. . .gcv)]. Disregarding the possibility that where the matrix G = [gfl)g(2) Y might not have full rank, which would necessitate a more complex treatment giving similar results, the joint entropy of our output variables S[P({y(")})]is related to logdet Y-'. We can find the information gain for a measurement with sensitivity vector g, under which A -+ A PggT, using the identities (equation 3.5). 1 Joint information gain = -A log det Y-' (4.3) 2 1 (gTA-'G)Y-'(GTA-' = --log 12 u2, u,2

+

[

+

David J. C. MacKay

598

The row vector v = gTAP1Gmeasures the correlations between the sensitivities g and gf,).The quadratic form vY-'vT measures how effectively these correlations work together to reduce the joint uncertainty in {y'")}. The denominator cz 0,'moderates this term in favor of measurements with small uncertainty.

+

4.1.2 Criticism. I will now argue that actually the joint entropy S[P({y(")})]of the interpolant's values is not an appropriate objective function. A simple example will illustrate this. Imagine that V = k, that is, the number of points defining our region of interest is the same as the dimensionality of the parameter space w. The resulting matrix G = [g(l)g(Z). ..g(v)]may be almost singular if the points x(') are close together, but typically it will still have full rank. Then the parameter vector w and the values of the interpolant {y'")} are in one to one (locally) linear correspondence with each other. This means that the change in entropy of P({y(")})is identical to the change in entropy of P(w) (Lindley 1956). This can be confirmed by substitution of Y-' = G-'AG-IT into (equation 4.3), which yields (equation 3.6). So if the datum is chosen in accordance with equation 4.3, so as to maximize the expected joint information gain about {y(")}, exactly the same choice will result as is obtained maximizing the first criterion, the expected total information gain about w (Section 3.1)! Clearly, this choice is independent of our choice of {y'")}, so it will have nothing to do with our region of interest. This criticism of the joint entropy is not restricted to the case V = k. The reason that this objective function does not achieve what we want is that the joint entropy is decreased by measurements which introduce correlations among predictions about {y'")') as well as by measurements that reduce the individual uncertainties of predictions. However, we do not want the variables {y'")} to be strongly correlated in some arbitrary way; rather we want each y(") to have small variance, so that if we are subsequently asked to predict the value of y at any one of the us, we will be able to make confident predictions.

4.1.3 Second Objective Function for Multiple Points. This motivates an alternative objective function: to maximize the average over u of the information gained about y(") alone. Let us define the mean marginal entropy,

sM= C P, s[P(Y("))]= -1 C P, U

2 ,

log 0:

+ const

where P, is the probability that we will be asked to predict ~ ( " 1 , and

Objective Functions for Data Selection

599

T A-1 g(,,) g [ u ) .For a measurement with sensitivity vector g, we obtain from (equation 4.1):

2 -

0,

-

(4.4)

The mean marginal information gain is demonstrated and criticized in Section 6. Two simple variations on this objective function can be derived. If instead of minimizing the mean marginal entropy of our predictions y@), we minimize the mean marginal entropy of the predicted noisy variables t('), which are modeled as deviating from y'") under additive noise of variance a:, we obtain equation 4.4 with u,' replaced by u,' + 0;. This alternative may lead to significantly different choices from equation 4.4 when any of the marginal variances 0,' fall below the intrinsic variance 0 : of the predicted variable. If instead we take an approach based on loss functions, and require that the datum we choose minimizes the expectation of the mean squared error of our predictions {y'")}, which is EM= C , Pug:, then we obtain as our objective function, to leading order, AEMN C , P,(gTA-'g(u))2/(0: 0:);this increases the bias in favor of reducing the variance of the variables y(") with largest 02. This is the same as the "Q-optimal" design (Fedorov 1972).

+

4.2 Comment on the Case of Linear Models. It is interesting to note that for a linear model [one for which y(x; w) = C Wh4h(X)]with quadratic penalty functions, the solutions to the first and second tasks depend only on the x locations where data were previously gathered, not on the actual data gathered {t}; this is because g(x) = $(x) independent of w, so A = aVVEw+/3C, ggTis independent of {t}. A complete data-gathering plan can be drawn up before we start. It is only for a nonlinear model that our decisions about what data to gather next are affected by our previous observations! 5 Maximizing the Discrimination between Two Models

Under the quadratic approximation, two models will make slightly different gaussian predictions about the value of any datum. If we measure a datum t at input value x, then

P ( t I F f f ) = Normal(pi, 0;) where the parameters pi, u: are obtained for each interpolation model Xi from its own best fit parameters wMF(i), its own Hessian Ai, and its own sensitivity vector g,:

David J. C. MacKay

600

Intuitively, we expect that the most informative measurement will be at a value of x such that p1 and p2 are as separated as possible from each other on a scale defined by 0 1 , 0 2 . Further thought will also confirm that we expect to gain more information if 0: and 0: differ from each other significantly; at such points, the "Occam factor" penalizing the more powerful model becomes more significant. , Let us define the information gain to be AS = S N - S N + ~ where S = - C,P ( X , ) logP('FI,). Exact calculations of AS are not analytically possible, so I will assume that we are in the regime of small information gain, that is, we expect measurement of f to give us a rather weak likelihood ratio P ( f I 'FI,)/P(fI 8 2 ) . This is the regime where Ip, - p2/ << 01,02. Using this assumption we can take the expectation over t, and a page of algebra leads to the result:

E(AS) N p('F11)p(X2) 2

[($

f

$)

(pi - ~

2

+

(=)*I

)

~

(5.1)

These two terms correspond precisely to the two expectations stated above. The first term favors measurements where pl and p2 are well separated; the second term favors places where 0: and 0; differ. Thus the third task has been solved. Fedorov (1972) makes a similar derivation but he uses a poor approximation that loses the second term. 6 Demonstration and Discussion

A data set consisting of 21 points from a one-dimensional interpolation problem was interpolated with an eight-hidden-unit neural network. The data were generated from a smooth function by adding noise with standard deviation CJ,, = 0.05. The neural network was adapted to the data using weight decay terms ac, which were controlled using the methods of MacKay (1992b) and noise level p fixed to l/& The data and the resulting interpolant, with error bars, are shown in Figure la. The expected total information gain, that is, the change in entropy of the parameters, is shown as a function of x in Figure lb. This is just a monotonic function of the size of the error bars. The same figure also shows the expected marginal information gain about three points of interest, {x'")} = {-1.25,0.0,1.75}. Notice that the marginal information gain is in each case peaked near the point of interest, as we would expect. Note also that the height of this peak is greatest for x(') = -1.25, where the interpolant oscillates rapidly, and lower for x(') = 1.75, where the interpolant is smoother. At each x = x('), the marginal information gain about x(") and the total information gain are equal. Figure l c shows the mean marginal information gain, where the points were defined to be a set of equally spaced points on of interest, {d")}, the interval [-2.1,4.1] (the same interval in which the training data lie).

601

Objective Functions for Data Selection

1.5 1

.5

0

-.5

-1

-2

0 total

0.8

-

0.6

-

0 . 4

-

0.2

-

r n a r g i n a L

2

4

i n f o r m a t i o n .ai n f ~ r m i t i o n 9-1

6 n

-

9

Figure 1: Demonstration of total and marginal information gain. (a) The data set, the interpolant, and error bars. (b) The expected total information gain and three marginal information gains. (c) The mean marginal information gain, with the region of interest defined by 300 equally spaced points on the interval [-2.1,4.1]. The information gains are shown on a scale of nats (1 nat = log2e bits).

602

David J. C. MacKay

The mean marginal information gain gradually decreases to zero away from the region of interest, as hoped. In the region to the left where the characteristic period of the interpolant is similar to the data spacing, the expected utility oscillates as x passes through the existing data points, which also seems reasonable. The only surprising feature is that the estimated utility in that region is lower on the data points than the estimated utility in the smooth region toward the right. 6.1 The Achilles' Heel of These Methods. This approach has a potential weakness: there may be models for which, even though we have defined the region of interest by the points {x'")}, the expected marginal information gain for a measurement at x still blows up as x + f m , like the error bars. This can occur because the information gain estimates the utility of a data point assuming that the model is correct; if we know that the model is actually an approximation tool that is incorrect, then it is possible that undesirable behavior will result. A simple example that illustrates this problem is obtained if we consider modeling data with a straight line y = w l x , where w1 is the unknown parameter. Imagine that we want to select data so as to obtain a model that predicts accurately at ,("I.Then if we assume that the model is right, clearly we gain most information if we sample at the largest possible 1x1, since such points give the largest signal-to-noise ratio for determining wl.If, however, we assume that the model is actually not correct, but only an approximation tool, then common sense tells us we should sample closer to d"). Thus if we are using models that we know are incorrect, the marginal information gain is really the right answer to the wrong question. It is a task for further research to formulate a new question whose answer is appropriate for any approximation model. Meanwhile, the mean marginal information gain seems a promising objective function to test further. 6.2 Computational Complexity. The computation of the suggested objective functions is moderately cheap once the inverse Hessian A-' has been obtained for the models concerned. This is a O(Nk2) O(k3) process, where N is the number of data points and k is the number of parameters; this process may already have been performed in order to evaluate error bars for the models, to evaluate the "evidence," to evaluate parameter "saliencies," and to enable efficient learning. This cost can be compared with the cost of locating a minimum of the objective function M, which in the worst case scales as O(Nk3) (taking the result for a quadratic function). Evaluation of the mean marginal information gain at C candidate points x then requires O(Ck2)+ O(CVk) time, where V is the number of points of interest x(") [O(k2)to evaluate A-'g for each x, and O(Vk) to evaluate the dot product of this vector with each g(,)]. So if C = O(k) and V = O(k),evaluation of the mean marginal information

+

Objective Functions for Data Selection

603

gain will be less computationally expensive than the inverse Hessian evaluation. For contexts in which this is too expensive, work in progress is exploring the possibility of reducing these calculations to O(k2)or smaller time by statistical methods. The question of how to efficiently search for the most informative x is not addressed here; gradient-based methods could be constructed, but Figure lc shows that the information gain is locally nonconvex, on a scale defined by the interdatum spacing.

7 Conclusion For three specifications of the information to be maximized, a solution has been obtained. The solutions apply to linear and nonlinear interpolation models, but depend on the validity of a local gaussian approximation. Each solution has an analog in the non-Bayesian literature (Fedorov 19721, and generalizations to multiple measurements and multiple output variables can be found there, and also in Luttrell (1985). In each case a function of x has been derived that predicts the information gain for a measurement at that x. This function can be used to search for an optimal value of x (which in large-dimensional input spaces may not be a trivial task). This function could also serve as a way of reducing the size of a large data set by omitting the data points that are expected to be least informative. And this function could form the basis of a stopping rule, that is, a rule for deciding whether to gather more data, given a desired exchange rate of information gain per measurement (Lindley 1956). A possible weakness of these information-based approaches is that they estimate the utility of a measurement assuming that the model is correct. This might lead to undesirable results. The search for ideal measures of data utility is still open. Acknowledgments

I thank Allen Knutsen, Tom Loredo, Marcus Mitchell, and the referees for helpful feedback. This work was supported by a Caltech Fellowship and a Studentship from SERC, UK. References Baum, E. B. 1991. Neural net algorithms that learn in polynomial time from examples and queries. IEEE Trans. Neural Networks 2(1), 5-19. Berger, J. 1985. Statistical Decision Theory and Bayesian Analysis. Springer, New York.

604

David J. C. MacKay

El-Gamal, M. A. 1991. The role of priors in active Bayesian learning in the sequential statistical decision framework. In Maximum Entropy and Bayesian Methods, W. T. Grandy, Jr. and L. H. Schick, eds., pp. 33-38. Kluwer, Dordrecht. Fedorov, V. V. 1972. Theory of optimal experiments. Academic Press, New York. Hwang, J-N., Choi, J. J., Oh, S., and Marks, R. J. I1 1991. Query-based learning applied to partially trained multilayer perceptrons. I E E E Trans. Neural Networks 2(1), 131-136. Jaynes, E. T. 1986. Bayesian methods: General background. In Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25. Cambridge University Press, Cambridge. Lindley, D. V. 1956. On a measure of the information provided by an experiment. Ann. Math. Statist. 27, 986-1005. Loredo, T. J. 1989. From Laplace to supernova SN 1987A: Bayesian inference in astrophysics. In Maximum Entropy and Bayesian Methods, I? Fougere, ed., pp. 81-142. Kluwer, Dordrecht. Luttrell, S. P. 1985. The use of transinformation in the design of data sampling schemes for inverse problems. Inverse Prob. 1, 199-218. MacKay, D. J. C. 1991a. Bayesian interpolation. Neural Comp. 4,415447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backprop networks. Neural Comp. 4, 448472. MacKay, D. J. C. 1992~.The evidence framework applied to classification networks. Neural Cornp., in press. Plutowski, M., and White, H. 1991. Active selection of training examples for network learning in noiseless environments. Dept. Computer Science, UCSD, TR 90-011. Skilling, J. 1992. Bayesian solution of ordinary differential equations. In Maximum Entropyand Bayesian Methods, Seattle2992, G. J. Erickson and C. R. Smith, eds. Kluwer, Dordrecht.

Received 17 July 1991; accepted 15 November 1991.

This article has been cited by: 2. Jeremy Lewi, David M. Schneider, Sarah M. N. Woolley, Liam Paninski. 2010. Automating the design of informative sequences of sensory stimuli. Journal of Computational Neuroscience . [CrossRef] 3. Daniel R. Cavagnaro, Jay I. Myung, Mark A. Pitt, Janne V. Kujala. 2010. Adaptive Design Optimization: A Mutual Information-Based Approach to Model Discrimination in Cognitive ScienceAdaptive Design Optimization: A Mutual Information-Based Approach to Model Discrimination in Cognitive Science. Neural Computation 22:4, 887-905. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Karl J. Friston, Jean Daunizeau, James Kilner, Stefan J. Kiebel. 2010. Action and behavior: a free-energy formulation. Biological Cybernetics 102:3, 227-260. [CrossRef] 5. Christopher DiMattina, Kechen Zhang. 2010. How to Modify a Neural Network Gradually Without Changing Its Input-Output FunctionalityHow to Modify a Neural Network Gradually Without Changing Its Input-Output Functionality. Neural Computation 22:1, 1-47. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 6. Steven C.H. Hoi, Rong Jin, Michael R. Lyu. 2009. Batch Mode Active Learning with Applications to Text Categorization and Image Retrieval. IEEE Transactions on Knowledge and Data Engineering 21:9, 1233-1248. [CrossRef] 7. Devis Tuia, FrÉdÉric Ratle, Fabio Pacifici, Mikhail F. Kanevski, William J. Emery. 2009. Active Learning Methods for Remote Sensing Image Classification. IEEE Transactions on Geoscience and Remote Sensing 47:7, 2218-2232. [CrossRef] 8. O. Stegle, L. Payet, J.-L. Mergny, D. J. C. MacKay, J. L. Huppert. 2009. Predicting and understanding the stability of G-quadruplexes. Bioinformatics 25:12, i374-i1382. [CrossRef] 9. D.P. Williams. 2009. Bayesian Data Fusion of Multiview Synthetic Aperture Sonar Imagery for Seabed Classification. IEEE Transactions on Image Processing 18:6, 1239-1254. [CrossRef] 10. Xuejun Liao, Lawrence Carin. 2009. Migratory Logistic Regression for Learning Concept Drift Between Two Data Sets With Application to UXO Sensing. IEEE Transactions on Geoscience and Remote Sensing 47:5, 1454-1466. [CrossRef] 11. Tai-Lin Chin, Parameswaran Ramanathan, Kewal K. Saluja. 2009. Modeling Detection Latency with Collaborative Mobile Sensing Architecture. IEEE Transactions on Computers 58:5, 692-705. [CrossRef] 12. Jeremy Lewi, Robert Butera, Liam Paninski. 2009. Sequential Optimal Design of Neurophysiology ExperimentsSequential Optimal Design of Neurophysiology Experiments. Neural Computation 21:3, 619-687. [Abstract] [Full Text] [PDF] [PDF Plus]

13. S. Still. 2009. Information-theoretic approach to interactive learning. EPL (Europhysics Letters) 85:2, 28005. [CrossRef] 14. TJ Brunette, Oliver Brock. 2008. Guiding conformation space search with an all-atom energy potential. Proteins: Structure, Function, and Bioinformatics 73:4, 958-972. [CrossRef] 15. Shen-Shyang Ho, H. Wechsler. 2008. Query by Transduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:9, 1557-1571. [CrossRef] 16. Juan Andrade-Cetto, Federico Thomas. 2008. A Wire-Based Active Tracker. IEEE Transactions on Robotics 24:3, 642-651. [CrossRef] 17. S. Gazut, J.-M. Martinez, G. Dreyfus, Y. Oussar. 2008. Towards the Optimal Design of Numerical Experiments. IEEE Transactions on Neural Networks 19:5, 874-882. [CrossRef] 18. Suju Rajan, Joydeep Ghosh, Melba M. Crawford. 2008. An Active Learning Approach to Hyperspectral Data Classification. IEEE Transactions on Geoscience and Remote Sensing 46:4, 1231-1242. [CrossRef] 19. Frederic Kaplan, Pierre-Yves Oudeyer, Benjamin Bergen. 2008. Computational models in the debate over language learnability. Infant and Child Development 17:1, 55-80. [CrossRef] 20. Francois Barbançon, Daniel P. Miranker. 2007. SPHINX: Schema integration by example. Journal of Intelligent Information Systems 29:2, 145-184. [CrossRef] 21. Wolfgang Hübner, Hanspeter A. Mallot. 2007. Metric embedding of view-graphs. Autonomous Robots 23:3, 183-196. [CrossRef] 22. Brent Bryan, Jeff Schneider, Christopher J. Miller, Robert C. Nichol, Christopher Genovese, Larry Wasserman. 2007. Mapping the Cosmological Confidence Ball Surface. The Astrophysical Journal 665:1, 25-41. [CrossRef] 23. Luca Scardovi, Marco Baglietto, Thomas Parisini. 2007. Active State Estimation for Nonlinear Systems: A Neural Approximation Approach. IEEE Transactions on Neural Networks 18:4, 1172-1184. [CrossRef] 24. Matthew C Coleman, David E Block. 2007. Nonlinear experimental design using Bayesian regularized neural networks. AIChE Journal 53:6, 1496-1509. [CrossRef] 25. Pierre-Yves Oudeyer, Frdric Kaplan, Verena V. Hafner. 2007. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation 11:2, 265-286. [CrossRef] 26. Zhiqiang Zheng, Balaji Padmanabhan. 2006. Selectively Acquiring Customer Information: A New Data Acquisition Problem and an Active Learning-Based Solution. Management Science 52:5, 697-712. [CrossRef] 27. M. Witczak. 2006. Toward the Training of Feed-Forward Neural Networks With the D-Optimum Input Sequence. IEEE Transactions on Neural Networks 17:2, 357-373. [CrossRef]

28. Douglas B. Kell. 2006. Metabolomics, modelling and machine learning in systems biology - towards an understanding of the languages of cells. Delivered on 3 July 2005 at the 30th FEBS Congress and 9th IUBMB conference in Budapest. FEBS Journal 273:5, 873-894. [CrossRef] 29. Liam Paninski . 2005. Asymptotic Theory of Information-Theoretic Experimental DesignAsymptotic Theory of Information-Theoretic Experimental Design. Neural Computation 17:7, 1480-1507. [Abstract] [PDF] [PDF Plus] 30. P. Sajda, S. Du, T.R. Brown, R. Stoyanova, D.C. Shungu, X. Mao, L.C. Parra. 2004. Nonnegative Matrix Factorization for Rapid Recovery of Constituent Spectra in Magnetic Resonance Chemical Shift Imaging of the Brain. IEEE Transactions on Medical Imaging 23:12, 1453-1465. [CrossRef] 31. Jong-Min Park. 2004. Convergence and application of online active sampling using orthogonal pillar vectors. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:9, 1197-1207. [CrossRef] 32. Xuejun Liao, L. Carin. 2004. Application of the theory of optimal experiments to adaptive electromagnetic-induction sensing of buried targets. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:8, 961-972. [CrossRef] 33. C. Cervellera, M. Muselli. 2004. Deterministic Design for Neural Network Learning: An Approach Based on Discrepancy. IEEE Transactions on Neural Networks 15:3, 533-544. [CrossRef] 34. R. Kothari, V. Jain. 2003. Learning from labeled and unlabeled data using a minimal number of queries. IEEE Transactions on Neural Networks 14:6, 1496-1505. [CrossRef] 35. Balaji Padmanabhan, Alexander Tuzhilin. 2003. On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities. Management Science 49:10, 1327-1343. [CrossRef] 36. Sambu Seo, M. Bode, K. Obermayer. 2003. Soft nearest prototype classification. IEEE Transactions on Neural Networks 14:2, 390-398. [CrossRef] 37. Yaochu Jin, M. Olhofer, B. Sendhoff. 2002. A framework for evolutionary optimization with approximate fitness functions. IEEE Transactions on Evolutionary Computation 6:5, 481-494. [CrossRef] 38. Masashi Sugiyama , Hidemitsu Ogawa . 2000. Incremental Active Learning for Optimal GeneralizationIncremental Active Learning for Optimal Generalization. Neural Computation 12:12, 2909-2940. [Abstract] [PDF] [PDF Plus] 39. K. Fukumizu. 2000. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks 11:1, 17-26. [CrossRef] 40. James Tin-Yau Kwok. 2000. The evidence framework applied to support vector machines. IEEE Transactions on Neural Networks 11:5, 1162-1173. [CrossRef] 41. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef]

42. Lisa M. Belue, Kenneth W. Bauer Jr., Dennis W. Ruck. 1997. Selecting Optimal Experiments for Multiple Output Multilayer PerceptronsSelecting Optimal Experiments for Multiple Output Multilayer Perceptrons. Neural Computation 9:1, 161-183. [Abstract] [PDF] [PDF Plus] 43. NICK CHATER, MIKE OAKSFORD. 1996. Deontic Reasoning, Modules and Innateness: A Second Look. Mind & Language 11:2, 191-202. [CrossRef] 44. Jong-Min Park, Yu Hen Hu. 1996. On-line learning for active pattern recognition. IEEE Signal Processing Letters 3:11, 301-303. [CrossRef] 45. H.H. Thodberg. 1996. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks 7:1, 56-72. [CrossRef] 46. Gustavo Deco, Bernd Schürmann. 1995. Statistical-ensemble theory of redundancy reduction and the duality between unsupervised and supervised neural learning. Physical Review E 52:6, 6580-6587. [CrossRef] 47. P Sollich. 1995. Journal of Physics A: Mathematical and General 28:21, 6125-6142. [CrossRef] 48. Gustavo Deco, Dragan Obradovic. 1995. Statistical physics theory of query learning by an ensemble of higher-order neural networks. Physical Review E 52:2, 1953-1957. [CrossRef] 49. David Cohn, Les Atlas, Richard Ladner. 1994. Improving generalization with active learning. Machine Learning 15:2, 201-221. [CrossRef] 50. Peter Sollich. 1994. Query construction, entropy, and generalization in neural-network models. Physical Review E 49:5, 4637-4651. [CrossRef] 51. Kazutaka Yamasaki, Hidemitsu Ogawa. 1994. Methods for choosing a training set which prevents over-learning. Systems and Computers in Japan 25:11, 88-98. [CrossRef] 52. David J. C. MacKay . 1992. The Evidence Framework Applied to Classification NetworksThe Evidence Framework Applied to Classification Networks. Neural Computation 4:5, 720-736. [Abstract] [PDF] [PDF Plus]

Communicated by Haim Sompolinsky

Four Types of Learning Curves Shun-ichi Amari Naotake Fujita Department of Mathematical Engineering and Information Physics, University of Tokyo, Tokyo 113, Japan

Shigeru Shinomoto Department of Physics, Kyoto University, Kyoto 606, Japan

If machines are learning to make decisions given a number of examples, the generalization error E ( t) is defined as the average probability that an incorrect decision is made for a new example by a machine when trained with t examples. The generalization error decreases as t increases, and the curve E ( t ) is called a learning curve. The present paper uses the Bayesian approach to show that given the annealed approximation, learning curves can be classified into four asymptotic types. If the machine is deterministic with noiseless teacher signals, then (1)E atp1 when the correct machine parameter is unique, and (2) E N a t p 2when the set of the correct parameters has a finite measure. If the teacher signals are noisy, then (3) E at-'/2 for a deterministic machine, and (4) E c + at-' for a stochastic machine. N

N

N

1 Introduction

A number of approaches have been proposed for machine learning. A classical example is the perceptron algorithm proposed by Rosenblatt (1961) for which a convergence theorem was given. A general theory of parametric learning was proposed by Amari (1967), Rumelhart et al. (19861, White (1989), and others, based on the stochastic gradient descent algorithm. See for example, Amari (1990) for a review of mathematical theory of neurocomputing. A new framework of PAC learning was proposed by Valiant (1984), in which both the computational complexity and stochastic evaluation of performance are taken into account. The theory was successfully applied to neural networks by Baum and Haussler (1989), where the VC dimension of a dichotomy class plays an important role. However, the framework is too restrictive, and Haussler et al. (1988) studied the general convergence rate of a learning curve by removing the algorithmic complexity constraint, while Baum (1990) has attempted to remove the worst case constraint on the probability distribution. Neural Computation 4, 605-618 (1992) @ 1992 Massachusetts Institute of TechnoIogy

606

S. Amari, N. Fujita, and S. Shinomoto

A different approach is taken by Levin et al. (1990) in which the statistical mechanical approach is coupled with the Bayesian approach. See also Schwartz et al. (1990). A generalization error is defined by the probability that a machine that has been trained with t examples misclassifies a novel example. The statistical average of the generalization error over randomly generated examples is formulated using the Bayes formula. This theory can also be viewed as a straightforward application of the predictive minimum description length method proposed by Rissanen (1986). However, it is in general difficult to calculate the generalization error, so the “annealed approximation” is suggested (Levin et al. 1990). The same problem has been treated by physicists (e.g., Hansel and Sompolinsky 1990; Sompolinsky et al. 1990; Gyiirgyi and Tishby 1990; Seung et al. 1992). They use the techniques of statistical mechanics such as the thermodynamic limit, replica method, and annealed approximation . have succeeded to evaluate the average generalization error ~ ( t ) They ) its phase transition for some in obtaining an asymptotic form of ~ ( tand specific models to which their methods are applicable. In the present paper, we also discuss the average generalization error under the annealed approximation in the Bayesian framework. It is not necessary, however, to use a statistical-mechanical framework or to assume a Gibbs-type probability distribution. Our theory is statistical and is applicable to more general models beyond the limitation of the applicability of physical methods such as the thermodynamical limit and the replica method. We obtain four types of asymptotic behaviors in the way & ( t ) decreases with t when t is large. The results are in agreement with those obtained by other methods for specific models. The asymptotic behavior does not depend on a specific structure of the target function, or on a specific architecture of the machine; they are universal in this sense. The asymptotic behavior of a learning curve depends only on whether the teacher signals are noisy or not, whether the machine is deterministic or stochastic, and whether there is a unique correct machine. The main concern of the present paper is the deterministic case. An exact analysis of stochastic cases will be given in a forthcoming paper (Amari and Murata 1992) without using the annealed approximation. 2 Main Results

The problem is stated as follows. Let us consider a dichotomy of an n-dimensional Euclidean space R“, R” = D+ U D-,D+ n D- = 4 where x E D+ is called a positive example and x E D- a negative example. A target signal y accompanies each x, where y = 1 for a positive example and y = -1 for a negative example. Given t randomly chosen x2, . . . , xt independently drawn from a probability distribuexamples XI, tion p ( x ) together with corresponding target signals y1, . . . ,yt, a learning

Four Types of Learning Curves

607

machine is required to estimate the underlying dichotomy. The machine is evaluated by its generalization error E ( t ) , that is, the probability that the next example xt+l produced by the same probability distribution is misclassified by the machine. We evaluate the average generalization error under the so-called annealed approximation and give universal theorems on the convergence rate of ~ ( tas) t tends to infinity. A machine considered here is specified by a set of continuous parameters w = ( ~ 1 , .. . ,w,) E R" and it calculates a function f(x, w). When the output of a machine is uniquely determined by the signum of f(x, w), the machine is said to be deterministic. In the deterministic case, the function f(x, w) specifies a dichotomy by

D+

I

= {x f ( X > W ) > 01

and

D- = {x If(x,w) 5 0 ) If the output is not deterministic but is given by a probability that is specified as a function of f(x, w), then it is said to be stochastic. A deterministic or stochastic neural network with modifiable synaptic weights gives a typical example of such a machine. For example, a layered feedforward neural network calculates a dichotomy function f (x,w), where w is a vector summarizing all the modifiable synaptic connection weights. The main subject of the present paper is asymptotic learning behavior of deterministic machines. The smoothness or differentiability of dichotomy functions f(x, w) is not required in the deterministic case, so that it is applicable to multilayer deterministic thereshold-element networks as well as to analog-element neural networks. We also discuss stochastic cases to compare the difference in their asymptotic behaviors. Since we use the regular statistical estimation technique, the smoothness of f(x, w) is required in the stochastic case to guarantee the existence of the Fisher information matrix. This type of network is a generalization of the fully smooth network introduced by Sompolinsky ef al. (1990) to the case where the network functions are smooth except for a threshold operation at the output. Suppose that there exists parameter wo such that the true machine calculates f(x, wg) and generates the teacher signal y based on it. In some deterministic cases, there exists a set of parameters all of which give the correct classification behavior. A typical case is that there is a neutral zone between the set of positive examples and the set of negative examples where no input signals are generated. We treat the cases both that a unique correct classifier exists and that a set of correct classifiers exists. The teacher signal y is said to be noiseless if y is given by the sign of f(x, WO) and noisy if y is stochastically produced depending on the value f(x, WO), irrespective of the machine itself being deterministic or stochastic.

S. Amari, N. Fujita, and S. Shinomoto

608

The following are the main results on the asymptotic behaviors of learning curves under the Bayesian framework and the annealed approximation. Case 1. The average generalization error behaves asymptotically as & ( t )N

m t

when a machine is deterministic, the teacher signal is noiseless, and the machine giving correct classification is uniquely specified by the mdimensional parameter WO. Case 2. The average generalization error behaves asymptotically as &(t)

C

N

t2

when a machine is deterministic, the teacher signal is noiseless, and the set of correct classifiers has finite measure in the parameter space. Case 3. The average generalization error behaves asymptotically as

when a machine is deterministic with a unique correct machine, but the teacher signal is noisy. Case 4. The average generalization error behaves asymptotically as

when a machine is stochastic. 3 The Average Generalization Error

We review here the Bayesian framework of learning along the line of Levin et al. (1990). However, it is not necessary to use the statisticalmechanical framework or to assume a Gibbs-type distribution. Let p(y I x,w)be the probability that a machine specified by w generates output y when x is input. It is given by a monotone function k ( f ) off in the stochastic case,

P(Y = 1 I x,w)= kLf(x,w)l,

0 I k ( f ) I 1,

k(O) = 1/2

In the deterministic case,

P(Y I x,w)= eIyf(X,W)l where e ( z ) = 1 when z > 0 and 0 otherwise, that is, p(y I x,w)is equal to 1 when yf(x, w)> 0 and is otherwise 0. Let q(w) be a prior distribution

Four Types of Learning Curves

609

of parameter w. Then, the joint probability density that the parameter w is chosen and t examples of input-output pairs

E(')

= [(Xl,YI),

(x2,y2),

. . . , (xt,yt)l

are generated by the machine is

n p(yi 1 t

~ ( wE"'),

= q(w)

xi, w)p(xi)

i=l

By using the Bayes formula, the posterior probability density of w is given by

where is the probability measure of ws generating (yl, . . . ,yt) when inputs (XI, . .. , xt) are chosen. In the deterministic case, the probability

is the measure of such w that are compatible with t examples $'), that is, those w satisfying yif(xi,w) > 0, for all i = 1,...,t. Therefore, the smaller this is, the easier it is to estimate the w that resolves the dichotomy. In the stochastic case, the probability Z($')) can also be used as a measure of identifiability of the true w. The quantity Zt is related to the partition function of the Gibbs distribution in the special but important case studied by Levin et al. (1990) or the physicist approach with the thermodynamical limit of the dimensionality of w tending to infinity (Seung et al. 1992). The quantity defined here is more general, although we use the same notation Zt. The generalization error ct* based on t examples J(t) is defined, in the deterministic case, as the probability that a machine that classifies t examples E@f correctly fails to correctly classify a new example xt+l. This is given by Et* =

&+I

Prob{yt+l f(xtfl, w) < 0 I yif(xi,w) > 0,

i = 1 , . . .,t } = 1 - -

zt

because Prob{yt+~f(xt+~,w) > 0 I yif(xi,w) > 0, i = 1 , - . . i t } Prob{y;f(x,,w) > 0, i = 1,.. . , t , t 1) Prob{yif(xi,w) > 0, i = 1,.. . , t }

+

S. Amari, N. Fujita, and S. Shinomoto

610

This quantity can also be considered as the generalization error in the stochastic case, because = Prob{yr+l I @'),xt+l} Zf is the probability that the machine will correctly output y,+1 given xt+1 under the condition that t examples (('1 have been observed. The generalization error .cf* is a random variable depending on the randomly generated examples The average generalization error E( t) is the average of E ~ * over all the possible examples $f) and a new pair (Yt+llXt+I)

E ( t ) = ( E ' * ) = 1 - (Zl+l/ZJ ( ) denoting the expectation with respect to [ ( t -t 1) = [ [ ( t ) ,(y,+l,xt+l)]. This quantity is closely related to the stochastic complexity E: introduced by Rissanen (1986), E: =

-(ln(l - E p ) )

=

(lnz,) - (lnZ,+1)

The actual evaluation of the quantity such as (Z,+l/Zf)and (lnZt)is generally a very hard problem and has been obtained only for a few model systems (see for example, Hansel and Sompolinsky 1990; Sompolinsky et al. 1990; Gyorgyi and Tishby 1990). We will show an exact example later. , introduce approximations, To obtain a rough estimate of E ( t ) or ~ fwe (Z,+l/Zt) N (.&+I)/(&)

and

(In&)

-

In(&)

called the "annealed average" (Levin et al. 1990), see also Schwartz et al. 1990). The approximations are valid if Zt does not depend sensitively on the most probable (xl, . . . ,xt). For this reason, we may call it the random phase approximation. The validity of the approximation is still open and we will return to this point in the final section (see also Seung etal. 1992). It is easy to show under the approximation that the average generalization error E ( t ) and the stochastic complexity E: are closely related in the asymptotic limit t -+ 00 in that &(t)

N

&;

provided E ( t ) -+ 0. [It is proved in Amari and Murata (1992) that the annealed approximation for EF gives a correct result in the stochastic case. See also Amari (1992) for the deterministic case.] Thus the remaining work is based on the evaluation of the average phase volume (Z,). 4 Case 1: A Unique Correct Deterministic Machine with a Noiseless Teacher

The expectation ( Z , ) is calculated for a deterministic machine as follows. Let s(w) be the probability that a machine specified by w classifies a

Four Types of Learning Curves

611

randomly chosen x correctly, that is, as the true classifier specified by wo does, and hence, s(w) = Probu(x, w) .f(x,wo) > 0) Since yi is the signum of f(x;,wo), (Z,)

=

i = 1,.. . , t } /q(~)Pr~b{yif(xi,w) > 0, i = 1,.. ., t I w}dw

=

/9(w){s(w)Pw

=

Prob{yif(x;,w) > 0,

where the last equality follows because yi = sgnf(xi,wo) and because f(x;,w) .f(xi,WO)> 0, for i = 1,.. . , t, are conditionally independent when w is fixed. When w is slightly deviated from the true wo in a unit direction e, I e I= 1, w = wo + r e the regions D+(w)and D-(w) are slightly deviated from the true D+(WO) and D-(wo). The classifier with w misclassifies those examples that belong to AD, which is the difference between D+(w) and D-(wo). Therefore, we have s(w) = 1 - J p(x)dx AD

We assume that the directional derivative a(e) = lim r-0

1Y /

AD

p(x) dx

exists and is strictly positive for any direction e. This holds when the probability of x belonging to AD caused by a small change Aw in w is in proportion to 1 Awl, irrespectively of the differentiability of f(x,w). Note that s(w) is not usually differentiable at w = woin the deterministic case. We use a method similar to the saddle point approximation to calculate (Zt), namely,

=

/exp{t[logs(w)

1 + :logq(w)]}dw

and by expanding logs(w) = -a(e)r

+ o(?)

and neglecting smaller order terms when 9(w) is regular, then for large t, (Z,)

= /exp{-ta(e)r}dw

Since the volume element dw can be written dw = rm-' dr dR

S. Amari, N. Fujita, and S. Shinomoto

612

where do is an angular volume element, then (Zt)

=

/exp{-ta(e)r}F'drdO

where

is a constant. From this, we have

proving the following theorem.

Theorem 1. Given the annealed approximation, a noiseless teacher and that wo is unique, the average generalization error of a deterministic machine decreases according to the universal formula m & ( t )= t

where m is the dimension of w. Remark We have assumed as regularity conditions in deriving the above result the existence of nonzero directional derivative a(e) and a regular prior distribution 9 ( w ) . These conditions hold in usual situations, however, it is possible to extend our result to more general cases. When the set w of correct classifiers forms a k-dimensional submanifold, we have, (Z,) 0: t-(m-k) so that &(t)N

m-k ~

t

In the case where the probability distribution p ( x ) is extremely densely concentrated on or sparsely distributed in the neighborhood of the boundary of D+ and D-,we have the following expansion

s(w)

-

1 - u(e)ra,

The result in this case is

-

a >0

m at so that the l / t law still holds in agreement with results obtained by other methods for many models (Haussler et al. 1988; Sompolinsky et al. 1990). &(f)

-

Four Types of Learning Curves

613

5 Case 2: Deterministic Case with a Noiseless Teacher, Where a Finite Measure of Correct Classifiers Exists

In this case, s(w) = 1 for w E So, where SO is the set of correct classifiers. We assume as a regularity condition that SOis a connected region having a piecewise smooth boundary. Moreover, we assume that if w

+ re,

= w,

where w, is the value of w at position w on dSo and e, is the unit normal vector at w, then s(w) can be expanded as

The calculation of (Z,) proceeds in this case as

= =

Lo

4( w ) dw

+ / / 9(w) exp{

ta (w)r} dr dw

-

C’

PO+i

where Po is the measure of SOand

From this it follows that

c(t)=l-(Po+&)l(Po+~)=p B where B=-

C’

PO

Hence the following theorem. Theorem 2. If So has a finite measure PO > 0, the convergence rate of c(f) for a deterministic machine is as B

E(t)

N

-

t2 where B is a constant depending on Poand the function f (x, w). Note that when SO tends to a point WO, Po tends to 0. This implies that B tends to infinity, and the asymptotic behavior changes to that of Theorem 1 where phase transition takes place. Remark. The above result is obtained from the annealed approximation of ( Z t + l / Z t ) .The above error probability E ( t ) is, roughly speaking, based

S. Amari, N. Fujita, and S. Shinomoto

614

on the learning scheme where at each time one chooses a machine randomly that correctly classifies the t examples [(t). However, the behavior is exponential, E(t) exp{ - c t } if the learning scheme is to choose a machine randomly such that it correctly classifies [(f) and keep it if it correctly classifies the ( t 1)st example, but if it does not then choose another machine randomly that does correctly classify the ( t 1) examples [(,+I). This is known as the perfect generalization (Seung et al. 1992). N

+

+

6 Case 3 A Deterministic Machine with a Noisy Teacher

This section treats the case of where the true classifier is unique and is a deterministic machine with parameter wo but teacher signals include stochastic error. The following is a typical example: The correct answer is 1 when f(x,wo) > 0 and -1 when f(x,wo) < 0, but the teacher signal y is 1 with probability kV(x, wo)] and is -1 with probability 1 - kcf). A typical function k is given by

k(u)

=

1

1

+ exp{ - P u }

where 1 / p is the so-called "temperature." In this case, we cannot usually find any w consistent with t examples [('I when t is large. We use instead a statistical estimator wtfrom t examples. From the statistical point of view, the problem is to estimate an unknown parameter vector w from t independent observations (xi,yi), i = 1, . . . , t drawn from the probability distribution specified by w,

The Fisher information matrix G is defined by G(w)=E

1

1

dlogr(x,y;w) a l o g ~ ( x , y ; w ) ~ aw aw

where E is the expectation with respect to the distribution ~ ( xy;, w),d/aw denotes the gradient column vector and the superscript T denotes the transposition. When the Fisher information exists, the estimation problem is regular. We assume that the problem is regular, which requires the differentiability of f(x,w). Let w, be the maximum likelihood estimator from t examples. It is well known that the covariance matrix of the maximum likelihood estimator w, is asymptotically given by

Four Types of Learning Curves

615

where G is the Fisher information matrix. The Fisher information matrix is explicitly given by G = p2/k(l - k)-(-)Tp(x)dx af af

aw aw where k = k(f) and f = f (x, w) (see Amari 1991; Amari and Murata 1992). The expectation of the generalization error is then given by & ( t )= 1 -

where

D

=

s

(S(W,))

D

=-

4

a(e)(eG-'eT)dR

Theorem 3. If their teacher signals include errors, then the average generalization error & ( f ) is asyrnpfoticaZZygiven by

D

& ( t )N -

Jt

This convergence rate coincides with one obtained by Hansel and Sompolinsky (1988). Here, the error probability is evaluated for a deterministic machine. When the temperature tends to 0, the teacher becomes noiseless. It should be noted that the Fisher information G tends to infinity in proportion to p2 and hence, D tends to 0 in this limit. The asymptotic behavior then changes to that of Theorem 1, phase transition taking place.

,&'

7 Case 4 Stochastic Machine

In the case of a stochastic machine, the teacher signals are also stochastic. The error probability ~ ( tnever ) tends to 0 in this case, but instead converges to some co > 0. We have

= /exp{flogs(w)} dw

where s(w) = /P(Y

I x, W)P(Y I x, WO)P(X) dxdy

Since s(w) is smooth in this case, we have the following expansion at its maximum wb, s(w) c - (w - wA)K(w - WA)T with a constant c and a positive definite matrix K. Hence, N

(Z,)

-

ctt-"'2

S. Amari, N. Fujita, and S. Shinomoto

616

so that in agreement with Sompolinsky et al. (1990) and others.

Theorem 4. For a stochastic machine, the generalization error behaves as E(t)

-

Eo

+ a-t

8 Discussions

We have thus obtained four typical asymptotic laws of the generalization error E ( f ) under the annealed approximation. However, the validity of the annealed approximation is questionable, as is discussed in Seung et al. (1992). Gyorgyi and Tishby (1990) give a different result for a simple perceptron model based on the replica method, whose validity is not guaranteed. In order to see the validity of the approximation, we calculate the exact E ( f ) for the following simple example: Consider predicting a half space of R2, where signals x = (xI,x2) are normally distributed with mean 0 and the identity covariance matrix, w is a scalar having a uniform prior 9(w), and

f (x, w ) = x1 cos w

+ x2 sin w

In this special case, the probability density function of Zt is given by

p ( ~ ,=) 4t2ztexp{-2tzf) By calculating Zf+l and averaging it over x t + l , the random variable ef is denoted as 1 e; = - ( u 2 + v’)/(u + v ) 2 where u and u are independent random variables subject to the same density function p(u) = t exp{ -tu} From this, we have the asymptotically exact result 2 3t

& ( t )= while the annealed approximation gives & ( t ) l/t. On the other hand, we have (logZ,) = c - log t N

so that

1 t where the annealed approximation holds. &f

=-

Four Types of Learning Curves

617

This shows that the approximation gives the same order of t-’ but a different factor. It is interesting to see how the difference depends on the number m of parameters in w. Looking from the point of view of statistical inference, the deterministic case and stochastic case are quite different. The estimator wtfrom t example is usually subject to a normal distribution with a covariance matrix of order l / t in the stochastic case. However, in the deterministic case, wtis usually not subject to a normal distribution. The squared error usually shows a stronger convergence. This is because the manifold of probability distributions has a Riemannian structure in the stochastic case (Amari 19851, while it has a Finslerian structure in the deterministic case (Amari 1987). This suggests a difference of the validity of the annealed approximation in the two cases. We will discuss this point in more detail in a forthcoming paper (Amari and Murata 1992; Amari 1992).

Acknowledgments The authors would like to thank Dr. Kevin Judd for valuable comments on the manuscript. This work was supported in part by Grant-in-Aid for Scientific Research on Priority Areas on “Higher-Order Brain Functions,” the Ministry of Education, Science and Culture of Japan.

References Amari, S. 1967. Theory of adaptive pattern classifiers. IEEE Trans. EC-16(3), 299-307. Amari, S. 1985. Differential-Geometrical Methods in Statistics. Springer Lecture Notes in Statistics, 28, Springer, New York. Amari, S. 1987. Dual connections on the Hilbert bundles of statistical models. In Geometrization ofStatistica1 Theory, C. T. J. Dodson, ed., pp. 123-152. ULDM, Lancaster, UK. Amari, S. 1990. Mathematical foundations of neurocomputing. Proc. IEEE 78, 1443-1463. Amari, S. 1991. Dualistic geometry of the manifold of higher-order neurons. Neural Networks 4, 443-451. Amari, S. 1992. A universal theorem on learning curves. To appear. Amari, S., and Murata, N. 1992. Predictive entropies and learning curves. To appear. Baum, E. B. 1990. The perceptron algorithm is fast for nonmalicious distributions. Neural Comp. 2, 248-260. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160.

618

S. Amari, N. Fujita, and S. Shinomoto

Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, W. K. Theumann and R. Koberle, eds., pp. 3-36, World Scientific, Singapore. Haussler, D., Littlestone, N., and Warmuth, K. 1988. Predicting 0 , l functions on randomly drawn points. Proc. COLT'88, pp. 280-295. Morgan Kaufmann, San Mateo, CA. Hansel, D., and Sompolinsky,H. 1990. Learning from examples in a single-layer neural network. Europhys. Lett. 11, 687-692. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. I E E E 78, 1568-1574. Rissanen, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14, 10801100. Rosenblatt, F, 1961. Principles of Neurodynamics. Washington, D.C.: Spartan. Rumelhart, D., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, I: Foundations. MlT Press, Cambridge, MA. Schwartz, D. B., Samalam, V. K., Solla, S. A., and Denker, J. S. 1990. Exhaustive learning. Neural Comp. 2, 374-385. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. To appear. Sompolinsky, H., Seung, S., and Tishby, N. 1990. Learning from examples in large neural networks. Phy. Rev. Letf. 64, 1683-1686. Valiant, L. G. 1984. A theory of the learnable. Comm. ACM 27,1134-1142. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. 1, 425-464.

Received 24 May 1991; accepted 15 November 1991

This article has been cited by: 2. Kazushi Ikeda. 2004. An Asymptotic Statistical Theory of Polynomial Kernel MethodsAn Asymptotic Statistical Theory of Polynomial Kernel Methods. Neural Computation 16:8, 1705-1719. [Abstract] [PDF] [PDF Plus] 3. Kazushi Ikeda. 2004. Geometry and learning curves of kernel methods with polynomial kernels. Systems and Computers in Japan 35:7, 41-48. [CrossRef] 4. Tatsuya Uezu. 2002. On the Conditions for the Existence of Perfect Learning and Power Law Behaviour in Learning from Stochastic Examples by Ising Perceptrons. Journal of the Physics Society Japan 71:8, 1882-1904. [CrossRef] 5. Sebastian Risau-Gusman, Mirta Gordon. 2001. Statistical mechanics of learning with soft margin classifiers. Physical Review E 64:3. . [CrossRef] 6. Sumio Watanabe . 2001. Algebraic Analysis for Nonidentifiable Learning MachinesAlgebraic Analysis for Nonidentifiable Learning Machines. Neural Computation 13:4, 899-933. [Abstract] [PDF] [PDF Plus] 7. S. Watanabe. 2001. Learning efficiency of redundant neural networks in Bayesian estimation. IEEE Transactions on Neural Networks 12:6, 1475-1486. [CrossRef] 8. Sumio Watanabe. 2000. On the generalization error by a layered statistical model with Bayesian estimation. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 83:6, 95-106. [CrossRef] 9. Hanzhong Gu, H. Takahashi. 2000. How bad may learning curves be?. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:10, 1155-1167. [CrossRef] 10. R. Urbanczik. 1998. Multilayer perceptrons may learn simple rules quickly. Physical Review E 58:2, 2298-2301. [CrossRef] 11. H. Sompolinsky, J. Kim. 1998. On-line Gibbs learning. I. General theory. Physical Review E 58:2, 2335-2347. [CrossRef] 12. Siegfried Bös. 1998. Statistical mechanics approach to early stopping and weight decay. Physical Review E 58:1, 833-844. [CrossRef] 13. Siegfried Bös, Manfred Opper. 1998. Journal of Physics A: Mathematical and General 31:21, 4835-4850. [CrossRef] 14. H. Takahashi, H. Gu. 1998. A tight bound on concept learning. IEEE Transactions on Neural Networks 9:6, 1191-1202. [CrossRef] 15. Tatsuya Uezu, Yoshiyuki Kabashima. 1996. Journal of Physics A: Mathematical and General 29:17, L439-L445. [CrossRef] 16. H. Gu, H. Takahashi. 1996. Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks 7:4, 953-968. [CrossRef]

17. Y. Hamamoto, S. Uchimura, S. Tomita. 1996. On the behavior of artificial neural network classifiers in high-dimensional spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:5, 571-574. [CrossRef] 18. Tatsuya Uezu, Yoshiyuki Kabashima. 1996. Journal of Physics A: Mathematical and General 29:3, L55-L60. [CrossRef] 19. N. Barkai, H. Seung, H. Sompolinsky. 1995. Local and Global Convergence of On-Line Learning. Physical Review Letters 75:7, 1415-1418. [CrossRef] 20. D. Barber , D. Saad , P. Sollich . 1995. Test Error Fluctuations in Finite Linear PerceptronsTest Error Fluctuations in Finite Linear Perceptrons. Neural Computation 7:4, 809-821. [Abstract] [PDF] [PDF Plus] 21. Yoshiyuki Kabashima , Shigeru Shinomoto . 1995. Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without QueriesLearning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries. Neural Computation 7:1, 158-172. [Abstract] [PDF] [PDF Plus] 22. Michael Kearns, H. Sebastian Seung. 1995. Learning from a population of hypotheses. Machine Learning 18:2-3, 255-276. [CrossRef] 23. Y Kabashima. 1994. Journal of Physics A: Mathematical and General 27:6, 1917-1927. [CrossRef] 24. Kukjin Kang, Jong-Hoon Oh, Chulan Kwon, Youngah Park. 1993. Generalization in a two-layer neural network. Physical Review E 48:6, 4805-4809. [CrossRef] 25. Shun-ichi Amari , Noboru Murata . 1993. Statistical Theory of Learning Curves under Entropic Loss CriterionStatistical Theory of Learning Curves under Entropic Loss Criterion. Neural Computation 5:1, 140-153. [Abstract] [PDF] [PDF Plus] 26. Y. Kabashima , S. Shinomoto . 1992. Learning Curves for Error Minimum and Maximum Likelihood AlgorithmsLearning Curves for Error Minimum and Maximum Likelihood Algorithms. Neural Computation 4:5, 712-719. [Abstract] [PDF] [PDF Plus]

619

Errata

In "How Tight Are the Vapnik-Chernvonenkis Bounds?" by David Cohn and Gerald Tesauro [Neural Computation 4(2), 249-2691, Figure 1 on page 255 and Figure 5 on page 260 should be exchanged. Also, on page 262, last paragraph, the sentence: If such an algorithm classifies its m randomly drawn training examples correctly, then it will, with high confidence, have a generalization error of at most t 5 d/m. should be amended: If such an algorithm classifies its m randomly drawn training examples correctly, then its expected generalization error will be at most E 5 d/m.

ARTICLE

Communicated by Larry Abbott

Nonlinear Dynamics and Symbolic Dynamics of Neural Networks John E. Lewis* Leon Glass Department of Physiology, McGill University, 3655 Drummond Street, Montrkal, Quibec, Canada H3G 1Y6

A piecewise linear equation is proposed as a method of analysis of mathematical models of neural networks. A symbolic representation of the dynamics in this equation is given as a directed graph on an N-dimensional hypercube. This provides a formal link with discrete neural networks such as the original Hopfield models. Analytic criteria are given to establish steady states and limit cycle oscillations independent of network dimension. Model networks that display multiple stable limit cycles and chaotic dynamics are discussed. The results show that such equations are a useful and efficient method of investigating the behavior of neural networks. 1 Introduction

An understanding of the dynamics of neural networks is essential to the study of many animal behaviors, from such primitive functions as respiration and locomotion to the most sophisticated such as perception and thought. In the past several decades, there have been extensive theoretical analyses complementing purely experimental approaches. In this paper we discuss the properties of theoretical models of neural networks from a perspective of nonlinear dynamics. We analyze qualitative features of the dynamics such as the existence and stability of steady states, cycles, and chaotic dynamics. Theoretical models of neural networks (Hopfield 1984) in the infinite gain limit can be written as a piecewise linear ordinary differential equation that was studied some years ago (Glass 1975, 1977a,b; Glass and Pasternack 1978). Since a good deal is known about the properties of the piecewise linear equation, this can be immediately translated to the study of neural network models. In Section 2 we motivate and illustrate the results by analyzing a didactic example of a 2 neuron network. In this section we also show how this simple example generalizes to *Present address: Department of Biology, University of California, San Diego, La Jolla, CA, 92093-0322.

Neural Computation 4,621-642 (1992)

@ 1992 Massachusetts Institute of Technology

John E. Lewis and Leon Glass

622

an N-dimensional piecewise linear ordinary differential equation that is equivalent to more familiar theoretical models of neural networks. In Section 3 we discuss the properties of the piecewise linear equation and obtain graphic criteria for stable steady states and limit cycle oscillations. In Section 4 we consider dynamics in several specific networks. We illustrate many different types of dynamics found in these networks with the emphasis on exotic dynamics such as multiple attractors, complex bifurcations, and chaotic dynamics. A preliminary report of some of these results has recently appeared (Lewis and Glass 1991). 2 Theoretical Models of Neural Networks -

2.1 A Network with Feedback Inhibition. This section contains a pedagogic example to illustrate the basic ideas of our approach. Consider a network consisting of 2 model neurons whose activities are represented by y1 and y2. We assume that yl excites y2, but that y2 inhibits y1. This network is modeled by the ordinary differential equation

dy2 dt = -y2

-

where H(y) is the Heaviside step function

The equations are piecewise linear and can be integrated analytically. For example, consider a point [yl(0), y*(O)] in the positive quadrant. Integrating equation 2.1, we obtain

yl(t) = -1

+ [yl(0) +1]exp(-t),

y2(t) = 1+ [y2(0)- 11exp(-t) (2.3)

From equation 2.3 we find that the trajectories in the positive quadrant are straight lines given by

In similar fashion, the trajectories in the other quadrants follow from a direct integration of the equations. In any given quadrant the flow is focused towards a point in the adjacent quadrant in a counterclockwise direction (Fig. 1A). All the focal points lie on one of the vertices of a square centered at the origin. The limiting behavior as t + 00 is determined as follows. Consider an initial point (s, 0) lying on the positive yl axis. After passing through all four quadrants the point will be mapped to [ k ( s )0,1 where k ( s ) is called the Poincark or return map and is given S

k(s)= 1 +4s

(2.5)

Nonlinear and Symbolic Dynamics of Neural Networks

623

Figure 1: (A) Phase plane portrait of the neural network in equation 2.1. All trajectories are straight lines directed to the focal points indicated by the heavy dots. (B) Coarse grained phase space associating a Boolean state to each of the four quadrants. (C) A directed graph showing the symbolic transitions allowed in this network. By iterating this map we find that the subsequent images of the initial point approach the origin (Glass and Pasternack 1978). Thus, the flow spirals in toward the origin. This discussion provides a complete analysis of this problem from the perspective of nonlinear dynamics. Symbolic dynamics provides a complementary method of capturing qualitative features of the flow. In symbolic dynamics one divides the phase space up into coarse regions and gives each a symbol. Instead of the trajectory that gives the values of the variables as a function of time,

John E. Lewis and Leon Glass

624

the dynamics is given by a sequence of symbols reflecting the coarse grained regions through which the flow passes. In the current case a natural coarse graining is to label each of the four quadrants by a Boolean state as shown in Figure 1B. The flow between the four states is now reflected as a directed graph (Fig. 1C). Thus, in symbolic dynamics the flow is represented as 10 --t 11

---f

01 + 00 + 10 + ' .

The analysis that follows shows several ways in which symbolic dynamics can be used in the analysis of neural networks. We show that (1) restrictions on symbolic transitions can be determined without a detailed numerical or analytical integration of the dynamics but based solely on the logical structure of the network; (2) in some cases the properties of the differential equations can be derived from the symbolic transitions; and (3) symbolic dynamics offers novel ways to classify dynamics. 2.2 N-Dimensional Equations. We now consider vector fields in N dimensions that represent a natural extension of the example in Section 2.1. In N dimensions, Euclidean phase space is subdivided into 2N regions, called orthants. All the orthants share a common point at the origin. In each orthant the trajectories are straight lines directed from each point of the orthant to a focal point. All the trajectories in each orthant are directed toward the same focal point, but the focal points may be different for the different orthants. The focal points are chosen such that the flows across the boundary between any two adjacent orthants are transverse and are of unique orientation. Piecewise linear equations, originally proposed by Glass and Pasternack (1978), represent the class of vector fields just described. There are N variables, designated y;, i = 1 , 2 , . . . ,n. For each variable yi, we define a corresponding Boolean variable, yi, where

(2.6)

The equations can be written in terms of the Boolean variables to give

dy, = A l ( i j l , .. . ,y,-l,ij,+l,. . . , y ~-) y,, dt

i == 1 , 2 , . . . ,N

(2.7)

where for each i the value of A;(y,, . . . , y i - l , y , + l , . . . , y ~does ) not depend on ji, and Ar is nowhere 0. Now we consider neural networks. One popular formulation (Hopfield 1984; Sompolinsky et al. 1988; Amit 1989) of neural networks is N

-yi

+ C wjjGj(yj)- r;, j=1

i = 1 , 2, . . . , N

(2.8)

Nonlinear and Symbolic Dynamics of Neural Networks

625

where N is the number of elements constituting the network, Gi is a nonlinear gain function describing the response of each element to an input, ri is a parameter that we interpret as the response threshold, wij gives the weight of the input of element j to element i, and wii = 0. It is usual to assume that the nonlinear functions Gj are monotonically increasing or decreasing sigmoidal functions. Consider the limit of infinite slope (or gain) of the sigmoidal function in which the functions G, are piecewise constant with a single discontinuity at 0, so that

(2.10)

Consequently, equations 2.7 and 2.8 are equivalent provided the values of A, are N

A i ( y 1 , . . . ,Yip,,

Y ~ + I , .. . , YN) = x w i j G / ( y l ) - ~

i ,

i = 1,2,. . . ,N (2.11)

]=I

This analysis shows that commonly used neural network models in the infinite gain limit are a special case of the piecewise linear equations proposed by Glass and Pasternack (1978). 3 Symbolic Dynamics and the State Transition Diagram

Some of the qualitative features of the dynamics of equation 2.7 can be appreciated from a symbolic representation of the dynamics on an Ndimensional hypercube, called an N-cube. We now describe some of the properties of N-cubes and then show their connection with the piecewise linear differential equations. Readers may find it useful to refer back to the example discussed in Section 2.1 to see how the concepts apply in a simple case. Several additional examples are given in Section 4. 3.1 The N-Cube. Boolean N-cubes have often been used to represent dynamics in switching networks (Keister et al. 1951). A Boolean variable is either 1 or 0. If there are N variables, then a Boolean state is an N-tuple of Is and 0s designating a value for each variable. For N variables there are 2N different Boolean states. For equation 2.7, the N-dimensional Euclidean phase space can be partitioned into 2N orthants, by the coordinate hyperplanes defined by y; = 0. Each orthant can be labeled by an N-tuple of 1s and Os, corresponding to the values of y i from equation 2.6. The N-cube can now be constructed by selecting a single point from each of the 2N orthants.

John E. Lewis and Leon Glass

626

Each of these points, called vertices, is labeled by the Boolean N-tuple designating the orthant from which it was derived. Each vertex can be connected to N adjacent vertices associated with Boolean states that differ in 1 locus. The resulting geometric object, called the N-cube, has 2N vertices and N x 2N-' edges. The (Hamming) distance between any 2 Boolean states, or vertices on the N-cube, is equal to the number of loci that differ in the 2 states. 3.2 Integration of the Piecewise Linear Equations. From the above discussion every point in phase space is mapped to a vertex of the Ncube. The solution curves of equation 2.7 originating at a point P = (PI ~ Z . I. . IPN)are given by 3

y;

=

A;

+ (pi

-

A;) exp(-t),

i = 1,2,. . . ,N

(3.1)

where

A

= Mpl, p2r . . . I p N )

(3.2)

Thus, all the local solutions to equation 2.7 in the orthant containing P are straight lines directed to a common focal point (A1,Az, . . . , AN). Each orthant in phase space has an associated focal point, so that the flows are piecewise linear and piecewise focused. Solving the equation is reduced to connecting the analytical solution curves in equation 3.1 in a piecewise fashion for each element. This entails finding the sequence of times at which the solution trajectory crosses one of the threshold hyperplanes, y; = 0. Given an initial condition P = ( p l , p Z , . . . ,PN) at a time t, the times, t; (i = 1,.. . ,N), at which each of the N variables would cross a threshold hyperplane are given

(3.3) Taking the minimum of t; (over all i) gives the next transition time. To carry out a numerical integration of the system, we compute the next transition time, then update the variables, and iterate the process using equation 3.3 with the new definitions of A,. 3.3 The Truth Table and the State Transition Diagram. Based on the above discussion, we have the coarse grained symbolic transition

- -

Plrp2,. . . ,pN

A1, A 2 , .

..>AN

where the first state represents the orthant of th.e initial point P and the second state represents the orthant of the focal point toward which the flow is directed. The table that gives the symbolic location of the focal point for each orthant is defined here as the truth table. Now consider the connection between the flows in the piecewise linear equations, and the truth table. Call the current Boolean state S1 and

Nonlinear and Symbolic Dynamics of Neural Networks

627

the Boolean state toward which the flow is directed, given by the truth table, S2. If the distance between S1 and S2 is 0, then all initial conditions in orthant S1 are directed towards the focal point in S1 leading to a stable steady state in the differential equation. If the distance between S1 and S2 is 1 then trajectories from all initial conditions in S1 are directed across the common boundary between S1 and S2. Now suppose the distance between S1 and S2 is greater than 1; for example, let the two states differ in n loci. Then the flow from S1 can be directed to any of the n different orthants that lie a distance of 1 from S1 and n - 1 from S2. The boundary that is crossed depends on the initial condition in S1. As a consequence of the above properties the allowed transitions can be represented as a directed graph on an N-cube. This directed graph is called the state transition diagram. As the dynamics of equation 2.7 evolve, the trajectories may pass into different orthants in phase space. Thus a symbolic sequence is generated corresponding to the sequence of orthants visited along the trajectory. These symbolic sequences are consistent with the allowed transitions from the state transition diagram on the N-cube. The state transition diagram for equation 2.7 has the following property. Each edge is oriented in oneand only one direction. This can be established using simple arguments (Glass 1975, 1977a,b). Since we assume that for - each i the value of Ai(y1,. . . ,y,-1, y,+l,. . . ,y ~ does ) not depend on yi ke., wii = O), an edge cannot be directed in two directions. From the construction of the state transition diagram, the number of directed edges in the state transition diagram is equal to the distance between each state on the left-hand side of the truth table, and the subsequent state on the right-hand side. Each column on the right-hand side of the truth table contributes ZN-' to the total distance, and there are N columns so that the total distance is N x 2N-1. This is equal to the total number of edges of the N-cube. Since no edge can be oriented in 2 directions, it follows that every edge has one unique orientation. 3.4 Steady States and Limit Cycles. A problem of general interest is to make assertions concerning the qualitative dynamics of equation 2.7 based solely on the state transition diagram. Previous work established rules to find stable steady states and limit cycles (Glass and Pasternack 1978). Very briefly, if the N edges at any given vertex of the N-cube are all directed toward it, then in the corresponding orthant of phase space there will be a stable steady state. These steady states, which are called extremal steady states, have been the main focus in the study of neural networks (Cowan and Sharp 1988). For an oscillation to result, a necessary condition is that there be a cyclic path in the state transition diagram. This is not, however, a sufficient condition to guarantee stability or uniqueness of the oscillation. In some circumstances, a much more powerful result can be found. A cyclic attractor is defined as a configuration on the N-cube that is analogous to a stable limit cycle in a differential equation. A cyclic attractor of length n is a cyclic path through n vertices

628

John E. Lewis and Leon Glass

of the N-cube such that (1) the edge between successive vertices on the cycle is directed from one to the next in sequence; (2) for any vertex on the cycle, there are N - 2 adjacent vertices that are not on the cycle, and the edge(s) from each of these adjacent vertices idare) directed toward the cycle. If there is a cyclic attractor in the state transition diagram then in the associated piecewise linear differential equations there is either a stable unique limit cycle in phase space such that all points in all orthants associated with the cyclic attractor approach the limit cycle in the limit t 4 00, or there is an asymptotic oscillatory approach to a point Pf.The point Pf is analogous to a stable focus with each of the n coordinates involved in the cyclic attractor approaching zero. The proof of this result relies on the explicit algebraic computation of the limiting properties of the Poincare map, giving the return to a threshold hyperplane. The Poincare map is (3.4)

where z is an (N - 1) vector on a threshold hyperplane, A is an (N - 1) x (N - 1) positive matrix, 4 is a nonnegative (N- 1)vector, and the brackets represent the inner product. For this system, the limiting properties of equation 3.4 on iteration follow using the Perron theorem (Glass and Pasternack 1978). 3.5 Chaotic Dynamics. Chaotic dynamics are aperiodic dynamics in a deterministic system in which there is a sensitivity to the initial state of the system so that two initial conditions, arbitrarily close to one another diverge exponentially over time (Ruelle 1989). Since the flow in any given orthant is always focused toward a single point, it is not obvious that equation 2.7 can display chaotic dynamics. However, as we will show in Section 4 [see also Lewis and Glass (1991)1, numerical integration shows chaotic dynamics in some systems. We have not yet found criteria for chaotic dynamics based on the state transition diagram on the N-cube.

4 Dynamics in Model Networks

In this section we illustrate the dynamics that we have observed so far in equation 2.7. Since we are interested in neural networks, we assume the special case given by equations 2.8 and 2.9, and we assume unless otherwise stated that for all j , the functions Gj(yj) are the same with uj = 1 and b, = 0, and 7;= 7 for all i. Likewise all terms of the connection matrix, wll, are either 1 or 0. Each of the N elements in the network has the same number of inputs, np.

Example 1: Steady States. Consider the network in Figure ZA, where the symbol y2 -I y1 implies y2 inhibits yl (w12:= 1) and T = 0.5. The integration of the dynamics starting from several initial conditions is

Nonlinear and Symbolic Dynamics of Neural Networks

629

A

B

yz ( -0.5,0.f

I

'

'

I

(0.5, -0.5)

C 0 l r [ ;

1 0 00

1 1

1 0 0 0

Figure 2: (A) Schematic diagram of a neural network in which there is mutual inhibition. (B) Integration of the PL equations in the phase plane, r = 0.5. The heavy dots indicate the focal points. (C) State transition diagram on the 2-cube (yly2) and the associated truth table. shown in Figure 2B, and the N-cube state transition diagram and truth table are shown in Figure 2C. There are two stable steady states.

Example 2: Stable Limit Cycle. A second example is the cyclic inhibitory loop shown in Figure 3A with N = 3. For T = 0.5, this system gives a unique stable limit cycle oscillation, associated with the cyclic attractor in the state transition diagram (Fig. 3B) (Glass 1975, 1977a,b; Glass and Pasternack 1978). Classification of stable limit cycles using the result in Section 3.4 has been considered previously. The number of distinct cyclic attractors under the symmetry of the N-cube is 1, 1, 3, 18 in dimensions 2, 3, 4, 5, respectively (Glass 1977a). Example 3: Multiple Limit Cycles in a 5-DNetwork. Now consider the dynamics of the 5-element network shown in Figure 4A (n, = 2) with r E (1,2). The state transition diagram for this network is shown in

John E. Lewis and Leon Glass

630

A

B

0 1 0 0 1 1

0 0 1

1 0 1

1 0 0

Figure 3: (A) Schematic diagram of a neural network composed of 3 elements. (B) State transition diagram on the 3-cube ( Y i i j 2 i j 3 ) and the associated truth table. There is a cyclic attractor passing through the states 001, 101, 100, 110, 010, 011.

Figure 4B. Let each vertex on one 4-cube represent all the vertices of the 5-cube in which the first digit of the 5-tuple is 0 and each vertex on the other 4-cube represent all the vertices of the 5-cube in which the first digit is 1. Each vertex on one 4-cube is connected to the equivalent vertex on the other. From numerical integration, there are 8 stable cycles that have different symbolic sequences for the range of T considered. The sequences of states for each of these cycles are shown in Table 1, and can also be followed on the state transition diagram. Each state is represented by the 5-tuple y l y 2 y 3 y 4 y 5 .

Figure 4: Facing page. (A) The 5-element network described in Example 3. All connections are inhibitory and of uniform magnitude (i.e., wij = 1). (B) The state transition diagram for the network in (A). The upper 4-cube represents all states in which the first locus is 1; the lower 4-cube represents all states in which the first locus is 0. See text for a more detailed description.

631

Nonlinear and Symbolic Dynamics of Neural Networks

A

10110

11110

I

1xxxx

11110

oxxxx

John E. Lewis and Leon Glass

632

Table 1: Limit Cycles in Example 3. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Cycle 8

10010 00010 00011 00001 00101 00100 01100 01000 01001 00001 10001 10000

11010 01010 01011 00011 00111 00101 00100 01100 01000 01001 11001 11000

10010 00010 00011 00111 00101 00100 01100 01000

01001 11001 10001 10000

10010 00010 01010 01011 00011 00111 00101 00100 01100 01000 01001 11001 11000 11010

10010 00010

00011 00111 00110 01110 01100 01000 01010 01011 01001 11001

10001 10000

10010 00010 01010 OlOll

00011 00001 00101 00100 01100 01000 01001 11001 11000 10000

10010 00010 00011 00111 00110 00100 01100 01000 01010 01011 01001 00001 10001 10000

10010 00010 00011 00111 00110 01110 01010 01011 01001 11001 10001 10000

The stability of each of these cycles depends on the value of 7 . For example, Figure 5 shows the three different stable cycles for T = 1.9. From left to right the panels show the time series corresponding to cycle 4, 5, and 6 from Table 1. To illustrate the bifurcations, we consider the returns to a 4-dimensional face 3 3 separating two neighboring orthants in phase space. The state transition diagram can be used to choose &. In this example, there is not one state transition that is common to all 8 cycles. However, the transition 01100 + 01000 is common to all cycles except cycle 7. By plotting the point of intersection of the trajectory with this hyperplane as the value of T is varied for different initial conditions, the regions of parameter space for which each of the 8 cycles are stable can be observed. Projections of the bifurcation diagram constructed in this way onto the yi-axes are shown in Figure 6. In such diagrams, more than one branch for a given value of 7 indicates that either there are multiple cycles, or that one or more cycles have multiple crossings of F3. In Figure 6, each different branch represents a unique cycle. We have numerically analyzed the bifurcations shown here. Briefly, the bifurcation occurring near T = 1.29 appears to be a subcritical Hopf bifurcation. Increasing T above this value causes cycles 2 and 3 to lose stability (upper and lower branches). Cycle 1 maintains its stability through this point (middle branch). Near T = 1.66, an exchange of stability from cycle 1 to cycle 4 occurs. Cycles 5 and 6 gain stability near T = 1.79 in a bifurcation that is similar to that occurring with cycles 2 and 3 for 7 = 1.29. Cycles 7 and 8 are stable for values of T E (1,1.25). Cycles 5 and 6 are identical under a relabeling transformation. To make this more clear, consider the sequences of the state transitions in

Nonlinear and Symbolic Dynamics of Neural Networks

cycle 5

cycle 4

0.0 -

yz -0.4 -0.8-

633

cycle 6

mmm

0.0-

y4 -0.4-0.8 -

0.0

y5 -0.4 -0.8

1 0

10

20

time

Figure 5: Multistability of cycles for the network described in Example 3 (Fig. 4A). Three different cycles are stable for T = 1.9 and are shown here by choosing three different initial conditions. The time axis is arbitrary.

Table 1 corresponding to the two cycles. As mentioned earlier, each state is represented by the 5-tuple y l y 2 y 3 y 4 y 5 . The relabeling transformation is the following: switch locus 1 with 3 and locus 2 with 4. In other words, the 5-tuple y & y 3 y 4 y 5 becomes y3y4y&Y5. Performing this transformation on one of the cycles shows that the sequences of state transitions are the same, and thus the cycles are the same. This symmetry is also evident in the connectivity of the network (Fig. 4A). A similar relationship exists between cycles 2 and 3 and cycles 7 and 8.

Example 4: Chaotic Dynamics in a 6-D Network. The 6-element network (n, = 3) in Figure 7 exhibits chaotic dynamics for some parameters. A projection of the attractor onto the y2-1~4 plane is shown in Figure 8A. We consider a face, F.4 separating the orthants defined by 011011 and 010011. Figure 8B shows the density histogram for the times between

John E. Lewis and Leon Glass

634

A

B Yl

C Y4

D Y5

1.o

1.s

2.0

z

Figure 6: Bifurcation diagram for returns to the face F3 and values of T from 1.001 to 1.999 in steps of 0.001. Each panel (A-D) shows the projections onto the different axes. 2000 successive returns to F4 and Figure 8C shows the density for a single variable y4 on each return to F4.We also consider the evolution of the density histograms for successive returns to F4 for a set of 2000 initial conditions in which y4 was varied, and the other variables were held constant. Figure 8D-F shows that by the 20th return, the histograms have reached a density that is the same as that of a single trajectory (Fig. SC). The approach to an invariant density and the observation of the same invariant density along a single trajectory constitute numerical evidence that this system is ergodic and has a unique invariant density, two features common to many chaotic systems (Lasota and Mackey 1985). Now we consider the effects of varying 7 on the dynamics of this network. The dynamics are tracked by plotting the values of y4 on 30 successive crossings of F4as T is varied. Figure 9A shows the resulting bifurcation diagram. As T is increased from 7 = 1.2, the dynamics change

Nonlinear and Symbolic Dynamics of Neural Networks

635

Figure 7 The 6-element network discussed in Example 4. from a simple limit cycle to aperiodic behavior. For larger values of 7, a limit cycle is evident again. In the aperiodic region, there are at least 4 periodic windows, spaced nearly symmetrically about T = 1.5. This simple example shows how r can influence the network dynamics. Since the step function nonlinearity in equation 2.9 is not realistic as a model for most biological processes, it is important to clarify the dynamics when continuous nonlinear functions are used in equation 2.8. We consider the continuous gain function, (4.1) where p is a positive constant, and equation 4.1 approaches a step function in the limit p -+ 00. A 4th order Rung-Kutta integration scheme (At = 0.01) was used to solve the equations. As the value of B increases, the continuous system exhibits a complex sequence of bifurcations. By using a method similar to that described for Example 3, a bifurcation diagram was constructed for values of between 7.0 and 12.0 (Fig. 9B). The value of y4 is plotted as the solution trajectory crosses the y3 = 0 hyperplane in a negative sense. For each value of p, a transient of 300 crossings was allowed before the next 30 points were plotted. A different example of a chaotic 6-dimensional network also shows a complex sequence of bifurcations as a continuous sigmoidal function is steepened (Lewis 1991; Lewis and Glass 1991). Further study of the bifurcations in these systems is needed.

John E. Lewis and Leon Glass

636

SINGLE TRAJECTORY

A

-

(1 11

Y2

-I

-1

RETURN MAPS

D 1st

I

0

Y4

B

E

1

,hr3

#

~,

I_j

0 0

5

15

10

IWO

,

I

I

3rd

,

20

Crossing times

C

1 #Irn]

F

,_A,

loo0

1

20th

0

-0.2

-0.1 Y4

0.0

y, return

Figure 8: (A) Projection of the dynamics onto the y2-~4 plane for r = 1.5. (B) shows the density histogram for the times between successive crossings of F4. (C) The density histogram of y4 for 2000 successive crossings of F 4 on a single trajectory. (D-F) The density histograms of y4 for the lst, 3rd, and 20th returns to F 4 using 2000 different initial conditions in which equally spaced values of y4 were chosen between -0.2 and 0, with yl = -0.293862, y2 = 0.478693, lj3 = 0.0, y5 = 0.028766, and y6 = 0.270764. Example 5: Chaotic Dynamics in a Network of 50 Elements. We now consider the dynamics of a larger network consisting of 50 elements with np = 5 and T = 2.5. Details concerning the network are in Lewis (1991) and will be provided on request. In this network, a search of 100 randomly chosen initial conditions revealed no steady states or limit cycles. As in previous examples, the value of a single variable on the return of the trajectory to an ( N - 1)-dimensional face, 5 , is considered. Figure 10A shows the density histograms of yl on & (left panel) and the times between returns (right panel) for 500 successive returns of a single trajectory. Figure 1OB shows the density histograms for y~ and the return times for a first return map constructed by taking initial condi-

637

Nonlinear and Symbolic Dynamics of Neural Networks

A

-0.1

Y4

I

1.2

1.4

I

i

1.6

1.8

z

6

-0.4 7

'.

..

I

1

I

I

1

8

9

10

11

12

P Figure 9: (A) Bifurcation diagram showing the value of y4 on 30 successive crossings of F4 after a sufficient transient, for different values of T . (B) Bifurcation diagram as a function of P for the continuous network described in Example 4. After a transient, the values of y4 are plotted when the trajectory crosses the y3 = 0 hyperplane in a negative sense, 30 consecutive times. tions on F5where all initial values were constant except y1 which was varied from -3.0 to -1.0 (as in Example 4). These density histograms are similar to those of a single trajectory (Fig. 10A) after only one return to F5. Calculating a first return map for a smaller interval of yl, between -2.1 and -1.9, again reveals similar density histograms (Fig. 1OC). This

John E. Lewis and Leon Glass

638

SINGLE TRAJECTORY

A

RETURN MAPS

B

n

100

80

1

C

-3.0

-2.5

-2.0

Yl

-1.5

I

-

-1.0

0

1

.

,

I

!

*

I

I

I

2000 4000 6000 8000 lo000

crossing time

Figure 10: (A) Left panel: The density histogram of yl on Fs for 500 successive crossings of a single trajectory. Right panel: The density histogram for the corresponding times between successive crossings of 35. (B) Left panel: The density histogram of yl on the first return map constructed for 500 different initial conditions on Fs in which the value of yl was varied between -3.0 and -1.0. Right panel: The density histogram of the corresponding crossing times for the data in the left panel. (C) Same as (B) but using initial values of yl between -2.1 and -1.9. system is chaotic and only a small number of passes through phase space is required for nearby trajectories to diverge. 5 Discussion Neural networks in nature display a wide range of complex dynamic behavior, ranging from more or less regular periodic behavior, to complex fluctuation that is difficult to characterize. The current paper shows that

Nonlinear and Symbolic Dynamics of Neural Networks

639

complex dynamics can also be found in commonly used mathematical models for neural networks. The dynamics can be classified by using the state transition diagram, which links the wiring diagram of the neural network to the coarse-grained activity patterns in the network. The simple structure of the mathematical equations enables us to demonstrate uniqueness and stability of limit cycle oscillations in some special circumstances. We comment briefly on the various dynamics found in these networks. We then discuss some open theoretical questions. The extremal steady states in these networks are easily identified using the state transition diagram. Recent theoretical studies (Amit 1989) have linked such steady states with memories in neural networks, but we are not aware of physiological studies supporting such an identification. Neural network limit cycle oscillations have been proposed as models for rhythmogenesis in a large variety of invertebrate and vertebrate systems (Friesen and Stent 1978; Matsuoka 1985). These studies considered networks of a specific connectivity and some analytical results have been obtained for the oscillatory properties of these systems (Matsuoka 1985; Cohen 1988). The current approach provides techniques for associating patterns of oscillation with the underlying connectivity of the network (Glass and Young 1979). A novel behavior demonstrated here is multistability of limit cycle oscillations, where parameter changes of the network can lead to changes in the stability of the various behaviors (Figs. 5 and 6). This behavior is interesting in light of recent experimental studies on multifunctional invertebrate neural networks (Harris-Warrickand Marder 1991; Meyrand et al. 19911, where different types of oscillatory behaviors can be exhibited by a single network. The simple networks here also support chaotic dynamics. Although the possible role of chaotic dynamics in normal and pathological functioning in neurobiology was raised several years ago (Guevara et al. 1983; Harth 1983) clear identification of chaos in neural systems has been possible only in rather simple systems in which there is a periodic forcing of neural tissue (Matsumoto et al. 1987; Takahashi et al. 1990). There have also been claims that neural activity in more complex situations is chaotic (Rapp et al. 1985; Skarda and Freeman 1987; Babloyantz and Destexhe 1987). The existence of chaotic dynamics in models of abstract neural networks has also been investigated. Kurten and Clark (1986) used spectral and dimensional analysis to identify chaos in a neural network model of 26 elements, each described by 2 ordinary differential equations and interconnected in a pseudorandom manner with each element receiving 7 inputs (both excitatory and inhibitory). Sompolinsky et al. (1988) have shown that some continuous models of neural networks will show a transition to chaotic dynamics as a gain parameter is varied. They proved this result in the thermodynamic limit (i.e., in an infinitely large network). Finally, Kepler et al. (1990) showed that for a specific formulation

640

John E. Lewis and Leon Glass

of a neural network implemented as an electronic circuit, chaotic dynamics could be observed in three dimensions. Their investigation focused, however, on the dynamics of four-dimensional networks. A compelling question is to identify and classify network connectivities that are capable of generating chaotic dynamics. Several mathematical questions are raised by this work. Previously we reported that assuming the same connection parameters for each element (i.e., w,j = l and np inputs to each element), the lowest dimension in which chaotic dynamics was observed is 6 (Lewis and Glass 1991). However, when the w,, are randomly chosen real numbers (with wii = 01, some networks of 5 elements have shown such behavior (less than 0.05% of networks tested). The general system, equation 2.7 has shown chaos in dimensions 4 and higher; in these cases the truth tables consisted of functions that do not correspond to those possible in neural network models. Preliminary studies of the prevalence of the various sorts of dynamic behavior have been carried out. For 2 and 3 input systems in dimension up to 20, chaotic dynamics appear to be a relatively rare phenomenon found in less than 1%of trials in which there were 20 initial conditions for each of 1000 different networks. The number of different attractor basins in these networks is also very small (usually less than 10 attractors, even in dimension 20). However, systematic numerical studies require searching in huge parameter spaces, since one is interested in studying the effects of the numbers of inputs, the thresholds, and the connectivity. The simplicity of numerically integrating the piecewise linear equations facilitate such studies. A difficult mathematical question is to analyze the bifurcations as the piecewise linear functions are replaced by continuous functions. Numerical results indicate that in systems with cyclic attractors, the limit cycles maintain stability over a large range of steepness of the sigmoidal function, but there is no proof of this (Glass 197%). The bifurcations in more complex networks that display chaos require further analysis. An especially interesting question is how chaos arises in these systems whose dynamics are dissipative within every coarse-grained orthant of phase space. This work provides a conceptually simple way to correlate the connectivity and dynamics of simple models of neural networks. This provides a foundation for the investigation of more realistic models of neural networks and complex rhythms observed in the laboratory.

Acknowledgments This research has been supported by funds from the Natural Sciences and Engineering Research Council of Canada and the Fonds F. C. A. R. du Quebec.

Nonlinear and Symbolic Dynamics of Neural Networks

641

References Amit, D. J. 1989. Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press, Cambridge. Babloyantz, A., and Destexhe, A. 1987. Chaos in neural networks. Proc. Int. Conf. Neural Networks, San Diego, CA, pp, 1-9. Cohen, M. A. 1988. Sustained oscillations in a symmetric cooperative-competitive neural network Disproof of a conjecture about content adressable memory. Neural Networks 1, 217-221. Cowan, J. D., and Sharp, D. H. 1988. Neural nets. Q. Rev. Biophys. 21,305427. Friesen, W. O., and Stent, G. S. 1978. Neural circuits for generating rhythmic movements. Annu. Rev. Biophys. Bioeng. 7, 37-61. Glass, L. 1975. Combinatorial and topological methods in nonlinear chemical kinetics. J. Chem. Phys. 63, 1325-1335. Glass, L. 1977a. Combinatorial aspects of dynamics in biological systems. In Statistical Mechanics and Statistical Methods in Theory and Application, U. Landman, ed., pp. 585-611. Plenum, New York. Glass, L. 197%. Global analysis of nonlinear chemical kinetics. In Statistical Mechanics, Pt. B, B. J. Berne, ed., pp. 311-349. Plenum, New York. Glass, L., and Pasternack, J. S. 1978. Stable oscillations in mathematical models of biological control systems. J. Math. Biology 6, 207-223. Glass, L., and Young, R. 1979. Structure and dynamics of neural network oscillators. Brain Res. 179, 207-218. Guevera, M. R., Glass, L., Mackey, M. C., and Shrier, A. 1983. Chaos in neurobiology. I E E E Trans. Syst. Man Cybern. SMC-13, 790-798. Harris-Warrick, R. M., and Marder, E. 1991. Modulation of neural networks for behavior. Annu. Rev. Neurosci. 14, 39-57. Harth, E. 1983. Order and chaos in neural systems: Approach to the dynamics of higher brain functions. IEEE Trans. Syst. Man Cybern. SMC-13, 782-789. Hopfield, J. J, 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nutl. Acad. Sci. U.S.A. 81, 3088-3092. Keister, W., Ritchie, A. E., and Washburn, S. H. 1951. The Design of Switching Circuits. D. Van Nostrand, Toronto. Kepler, T. B., Datt, S., Meyer, R. B., and Abbott, L. F. 1990. Chaos in a neural network circuit. Physica D 46, 449-457. Kiirten, K. E., and Clark, J. W. 1986. Chaos in neural systems. Phys. Lett. A 114, 413-418. Lasota, A,, and Mackey, M. C. 1985. Probabilistic Properties of Deterministic Systems. Cambridge University Press, Cambridge. Lewis, J. E. 1991. Dynamics of neural networks and respiratory rhythm generation. M.Sc. Thesis, McGill University. Lewis, J. E., and Glass, L. 1991. Steady states, limit cycles, and chaos in models of complex biological networks. Int. I. Bijkrc. Chaos. 1, 477-483. Matsumoto, G., Aihara, K., Hanyu, Y., Takahashi, N., Yoshizawa, S., and Nagumo, J, 1987. Chaos and phase locking in normal squid axons. Phys. Lett. A 123, 162-166.

642

John E. Lewis and Leon Glass

Matsuoka, K. 1985. Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biol. Cybern. 52, 367-376. Meyrand, P., Simmers, J., and Moulins, M. 1991. Construction of a pattern generating circuit with neurons of different networks. Nature (London) 351, 60-63. Rapp, P., Zimmerman, I. D., Albano, A. M., deGuzman, G. C., Greenbaum, N. N., and Bashore, T. R. 1985. Experimental studies of chaotic neural behavior: Cellular activity and electroencephalographicsignals. In Nonlinear Oscillations in BiologyandChernistry, H. G. Othmer, ed., pp. 175-205. SpringerVerlag, Berlin. Ruelle, D. 1989. Clzaotic Evolution and Strange At tractors. Cambridge University Press, Cambridge. Skarda, C. A., and Freeman, W. J. 1987. How brains make chaos in order to make sense of the world. Behav. Brain Sci. 10, 161-195. Sompolinsky, H., Crisanti, A., and Sommers, H. J. 1988. Chaos in random neural networks. Phys. Rev. Lett. 61,259-262. Takahashi, N., Hanyu, Y., Musha, T., Kubo, R., and Matsumoto, G. 1990. Global bifurcation structure in periodically stimulated giant axons of squid. Physica D 43,318-334.

Received 19 August 1991; accepted 3 January 1992.

This article has been cited by: 1. R. Edwards, P. Driessche, Lin Wang. 2007. Periodicity in piecewise-linear switching networks with delay. Journal of Mathematical Biology 55:2, 271-298. [CrossRef] 2. Randall D. Beer. 2006. Parameter Space Structure of Continuous-Time Recurrent Neural NetworksParameter Space Structure of Continuous-Time Recurrent Neural Networks. Neural Computation 18:12, 3009-3051. [Abstract] [PDF] [PDF Plus] 3. Xiao-Song Yang, Yan Huang. 2006. Complex dynamics in simple Hopfield neural networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:3, 033114. [CrossRef] 4. Qingdu Li, Xiao-Song Yang. 2006. Chaotic dynamics in a class of three dimensional Glass networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:3, 033101. [CrossRef] 5. Jaeseung Jeong, Yongho Kwak, Yang In Kim, Kyoung J. Lee. 2005. Dynamical Heterogeneity of Suprachiasmatic Nucleus Neurons Based on Regularity and Determinism. Journal of Computational Neuroscience 19:1, 87-98. [CrossRef] 6. Asa Ben-Hur, Hava T. Siegelmann. 2004. Computation in gene networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 14:1, 145. [CrossRef] 7. K. Aihara. 2002. Chaos engineering and its application to parallel distributed processing with chaotic neural networks. Proceedings of the IEEE 90:5, 919-930. [CrossRef] 8. Hidde de Jong. 2002. Modeling and Simulation of Genetic Regulatory Systems: A Literature Review. Journal of Computational Biology 9:1, 67-103. [CrossRef] 9. R. Edwards, L. Glass. 2000. Combinatorial explosion in model gene networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 10:3, 691. [CrossRef] 10. K. Pakdaman, C. Grotta-Ragazzo, C. Malta. 1998. Transient regime duration in continuous-time neural networks with delay. Physical Review E 58:3, 3623-3627. [CrossRef] 11. Thomas Mestl, R. Bagley, Leon Glass. 1997. Common Chaos in Arbitrarily Complex Feedback Networks. Physical Review Letters 79:4, 653-656. [CrossRef] 12. R. D. Hangartner, P. Cull. 1995. A ternary logic model for recurrent neuromime networks with delay. Biological Cybernetics 73:2, 177-188. [CrossRef]

NOTE

Communicated by Charles Stevens

Cortical Cells Should Fire Regularly, But Do Not William R. Softky Christof Koch Computation and Neural Systems Program, California Institute of Technology,Pasadena, CA 92225 USA

When a typical nerve cell is injected with enough current, it fires a regular stream of action potentials. But cortical cells in v i m usually fire irregularly, reflecting synaptic input from presynaptic cells as well as intrinsic biophysical properties. We have applied the theory of stochastic processes to spike trains recorded from cortical neurons (Tuckwell 1989) and find a fundamental contradiction between the large interspike variability observed and the much lower values predicted by well-accepted biophysical models of single cells. Over 10,000 extracellular spike trains were recorded from cells in cortex of the awake macaque monkey responding to various visual stimuli. These trains were recorded from V1 (Knierim and Van Essen 1992) and MT (Newsome et al. 1989). Traces were chosen from well-isolated, fastfiring, nonbursting neurons. Because the firing frequency varied over the course of the stimulus presentation, each interspike interval At (i.s.i.) was assigned to 1 of 10 histograms for that cell, with each histogram representing a narrow range of instantaneous firing rates, for example, 50-100 Hz, 250-300 Hz. From each histogram we computed a measure of the variability of the spike train, the dimensionless coefficient of variation (CV), which is the ratio of the standard deviation to the mean of the i.s.i. histogram:

The approximate CV values measured here are in good agreement with other reports of CV (Douglas and Martin 1991; Burns and Webb 1976): interspike intervals are near-random, and close to that expected for the i.s.i. histogram of a pure Poisson process (i.e., CV M 0.5-1; see Fig. 1). We attempted to account for this observed variability using a simple integrate-and-fire model requiring N random (Poisson) impulse inputs to reach the threshold (Tuckwell 1989). For such a neuron, CV = 1/a. An absolute refractory period t o reduces this value when is near to Neural Computation 4, 643-646 (1992)

@ 1992 Massachusetts Institute of Technology

William R. Softky and Christof Koch

644

Observed and Predicted Neuronal Variability

0.5’’

0.4.-

0.3~-

0.2.-

0.1 ~-

Figure 1: Comparison of the randomness measure CV as a function of interspike interval for three different data sets: (1) experimentally recorded, nonbursting, macaque cortical neurons (MT and V1; empty squares; we observed no systematic difference between the two data sets); (2) detailed compartmental simulation of a reconstructed layer V pyramidal cell (filled, connected squares); (3) different integrate-and-fire models with refractory period of 1.0 msec and N EPSPs required to fire (crosses and jagged lines). Crosses are predictions by integrate-and-fire models with N = 1 (top), N = 4 (middle), and N = 51 (bottom). Jagged lines show simulated leaky integrators with N = 51: T, = 0.2 msec (top) or T,,, = 13 msec (bottom). Conventional parameters (i.e., T, > 10 msec and N > 50) fail to account for the high variability observed.

(Tuckwell 1989). Numerical simulations with a leak term 7 , = RC show >> 7,. CV can also increase that CV increases significantly only for during periods of very strong inhibition, but such inhibition was not found in a recent electrophysiological search (Berman et al. 1991). Because most researchers estimate that 100 or more inputs are required to trigger a cell (Douglas and Martin 1991; Abeles 1991), as well as 7 , 2 10 msec and

Firing of Cortical Cells

645

to 2 1.O msec, the above models predict that CV should be far lower than is seen in the monkey data for the high firing rates observed (see Fig. 1). There remains the possibility that more realistic Hodgkin-Huxley neurons (whose firing currents are continuous functions of voltage) might be able to amplify input irregularities more effectively than the highly simplified integrate-and-fire neuron above, which has a discontinuous firing threshold and no such sensitive voltage regime. We expect that this difference would be significant only in a neuron whose soma spends most of its “integration time” resting just below threshold (unlike the cortical cells in question, which have high firing rates and hence no stationary resting potential during periods of peak activation). But the only persuasive test would be the simulation of a Hodgkin-Huxley-like neuron in the presence of random synaptic input. We therefore simulated a biophysically very detailed compartmental model of an anatomically reconstructed and physiologically characterized layer V pyramidal cell (Bernander et al. 1991). The model included not only the traditional Hodgkin-Huxley currents, but five additional 820 active currents at the cell body (INa,INa-p, Ic,, IDR, IA, IM,IK(c~)), compartments, and a passive decay time of T~ = 13 msec. Spatially distributed random (Poisson) excitatory synaptic conductance inputs gave rise to strong somatic EPSPs with mean amplitudes around 1.6 mV. We provided enough synaptic input to this cell to generate 200 spike trains (with mean frequencies comparable to the spike trains recorded from monkey) and subjected them to the same analysis. The resulting CV values agree with the simple integrator models, and disagree strongly with the monkey data (see Fig. 1). In addition, the number of spikes n in each simulated train varied by no more than a few percent, a much smaller amount than the fi variation observed for real cells. Therefore, we conclude that the present knowledge of pyramidal cell biophysics and dynamics is unable to account for the high CV seen in fast-firing monkey visual cortex neurons: these cells should fire regularly, but do not. Neither the data nor the model used here are controversial. But they are not consistent with each other. Only a few situations could cause near-random, fast firing in these monkey cells: for example, strong synaptic conductance changes that create a very fast effective time constant (7, < 0.2 msec; see Fig. 11, or nonrandom synaptic input, which is highly synchronized on a millisecond scale (Abeles 1991; Koch and Schuster 1992). In the absence of such phenomena, the Central Limit Theorem makes these cells observed near-random spiking inconsistent with their assumed role as devices that temporally integrate over many inputs. Thus, it may well be that the time scale of cortical computation is much faster than previously realized.

William R. Softky and Christof Koch

646

Acknowledgments This research was funded by a NSF Presidential Young Investigator Award, by the Office of Naval Research, a n d by the James S. McDonnell Foundation.

References Abeles, M. 1991. Corticonics. Cambridge University Press, New York. Berman, N., Douglas, R., Martin, K., and Whitteridge, D. 1991. Mechanisms of inhibition in cat visual cortex. J. Physiol. 440,697-722. Bernander, O., Douglas, R., Martin, K., and Koch, C. 1991. Synaptic background activity determines spatio-temporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. U.S.A. 88, 11569-11573. Burns, B., and Webb, A. C. 1976. The spontaneous activity of neurones in the cat's cerebral cortex. Proc. R. SOC.London B 194,211-223. Douglas, R., and Martin, K. 1991. Opening the grey box. Trends Neurosci. 14, 286-293. Knierim, J., and Van Essen, D. 1992. Neuronal responses to static textural patterns in area V1 of the alert macaque monkey. J. Neurophysiol. 67, 961-980. Koch, C., and Schuster, H. 1992. A simple network showing burst synchronization without frequency locking. Neural Comp. 4,211-223. Newsome, W., Britten, K., Movshon, J. A., and Shadlen, M. 1989. Single neurons and the perception of motion. In Neural Mechanisms of Visual Perception, D. Man-Kit Lam and C. Gilbert, eds., pp, 171-198. Portfolio Publishing Co., The Woodlands, TX. Tuckwell, H. C. 1989. Stochastic processes in theneurosciences. Society for Industrial and Applied Mathematics, Philadelphia. ~-

____

Received 29 October 1991; accepted 4 February 1992.

This article has been cited by: 1. Kanaka Rajan, L. Abbott, Haim Sompolinsky. 2010. Stimulus-dependent suppression of chaos in recurrent neural networks. Physical Review E 82:1. . [CrossRef] 2. Jörn Davidsen, Peter Grassberger, Maya Paczuski. 2008. Networks of recurrent events, a theory of records, and an application to finding causal signatures in seismicity. Physical Review E 77:6. . [CrossRef] 3. Gyuchang Lim, Soo Yong Kim, Kyungsik Kim, Dong-In Lee, Myung-Kul Yum. 2008. Regularity Analysis of Inter-Out-of-Equilibrium State Intervals in Financial Markets. Journal of the Physical Society of Japan 77:3, 033801. [CrossRef] 4. Hideo Hasegawa. 2007. Generalized rate-code model for neuron ensembles with finite populations. Physical Review E 75:5. . [CrossRef] 5. A. N. Burkitt. 2006. A Review of the Integrate-and-fire Neuron Model: I. Homogeneous Synaptic Input. Biological Cybernetics 95:1, 1-19. [CrossRef] 6. Gleb Basalyga , Emilio Salinas . 2006. When Response Variability Increases Neural Network Robustness to Synaptic NoiseWhen Response Variability Increases Neural Network Robustness to Synaptic Noise. Neural Computation 18:6, 1349-1379. [Abstract] [PDF] [PDF Plus] 7. Hiroya Nakao. 2006. Population Coding by Globally Coupled Phase Oscillators. Journal of the Physics Society Japan 75:3, 034001. [CrossRef] 8. Nicolas P Cottaris, Sylvia D Elfar. 2005. How the retinal network reacts to epiretinal stimulation to form the prosthetic visual input to the cortex. Journal of Neural Engineering 2:1, S74-S90. [CrossRef] 9. B. Scott Jackson . 2004. Including Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical NeuronsIncluding Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical Neurons. Neural Computation 16:10, 2125-2195. [Abstract] [PDF] [PDF Plus] 10. S. Mikula, E. Niebur. 2004. Correlated Inhibitory and Excitatory Inputs to the Coincidence Detector: Analytical Solution. IEEE Transactions on Neural Networks 15:5, 957-962. [CrossRef] 11. Shawn Mikula , Ernst Niebur . 2003. The Effects of Input Rate and Synchrony on a Coincidence Detector: Analytical SolutionThe Effects of Input Rate and Synchrony on a Coincidence Detector: Analytical Solution. Neural Computation 15:3, 539-547. [Abstract] [PDF] [PDF Plus] 12. Jörn Davidsen, Heinz Schuster. 2002. Simple model for 1/f^{α} noise. Physical Review E 65:2. . [CrossRef]

13. Hideo Hasegawa. 2001. An Associative Memory of Hodgkin-Huxley Neuron Networks with Willshaw-Type Synaptic Couplings. Journal of the Physics Society Japan 70:7, 2210-2219. [CrossRef] 14. Stefano Fusi , Mario Annunziato , Davide Badoni , Andrea Salamon , Daniel J. Amit . 2000. Spike-Driven Synaptic Plasticity: Theory, Simulation, VLSI ImplementationSpike-Driven Synaptic Plasticity: Theory, Simulation, VLSI Implementation. Neural Computation 12:10, 2227-2258. [Abstract] [PDF] [PDF Plus] 15. A. N. Burkitt , G. M. Clark . 2000. Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic InputsCalculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs. Neural Computation 12:8, 1789-1820. [Abstract] [PDF] [PDF Plus] 16. Quentin Pauluis , Stuart N. Baker . 2000. An Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event AnalysisAn Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event Analysis. Neural Computation 12:3, 647-669. [Abstract] [PDF] [PDF Plus] 17. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 18. Hideo Hasegawa. 2000. Responses of a Hodgkin-Huxley neuron to various types of spike-train inputs. Physical Review E 61:1, 718-726. [CrossRef] 19. G. Mato. 1999. Stochastic resonance using noise generated by a neural network. Physical Review E 59:3, 3339-3343. [CrossRef] 20. Douglas A. Miller , Steven W. Zucker . 1999. Computing with Self-Excitatory Cliques: A Model and an Application to Hyperacuity-Scale Computation in Visual CortexComputing with Self-Excitatory Cliques: A Model and an Application to Hyperacuity-Scale Computation in Visual Cortex. Neural Computation 11:1, 21-66. [Abstract] [PDF] [PDF Plus] 21. Guido Bugmann, Chris Christodoulou, John G. Taylor. 1997. Role of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial ResetRole of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial Reset. Neural Computation 9:5, 985-1000. [Abstract] [PDF] [PDF Plus] 22. Herman P. Snippe. 1996. Parameter Extraction from Population Codes: A Critical AssessmentParameter Extraction from Population Codes: A Critical Assessment. Neural Computation 8:3, 511-529. [Abstract] [PDF] [PDF Plus] 23. K. Schmoltzi, H. Schuster. 1995. Introducing a real time scale into the Bak-Sneppen model. Physical Review E 52:5, 5273-5280. [CrossRef]

24. Marius Usher , Martin Stemmler , Christof Koch , Zeev Olami . 1994. Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field PotentialsNetwork Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials. Neural Computation 6:5, 795-836. [Abstract] [PDF] [PDF Plus] 25. Ying -Cheng Lai, Raimond L. Winslow, Murray B. Sachs. 1994. A model of selective processing of auditory-nerve inputs by stellate cells of the antero-ventral cochlear nucleus. Journal of Computational Neuroscience 1:3, 167-194. [CrossRef]

NOTE

Communicated by Andrew Barto

A Simplified Neural-Network Solution through Problem Decomposition: The Case of the Truck Backer-Upper Robert E. Jenkins Ben P. Yuhas* The Applied Physics Laboratory, The Johns Hopkins University, Baltimore, M D 21218 U S A Nguyen and Widrow (1990) demonstrated that a feedforward neural network could be trained to steer a tractor-trailer truck to a dock while backing up. The feedforward network they used to control the truck contained 25 hidden units and required tens of thousands of training examples. The training strategy was to slowly expand the region in which the controller could operate, by starting with positions close to the dock and after a few thousand iterations moving the truck a little farther away. We found that a very simple solution exists requiring only two hidden units in the controller. The solution was found by decomposing the problem into subtasks. The original goal was to use the solution to these subtasks to reduce training time. What we found was a complete solution. Nevertheless, this example demonstrates how building prior knowledge into the network can dramatically simplify the problem. The problem is composed of three subtasks. First, the truck must be oriented so that the trailer is nearly normal to the dock. This is accomplished by continuously driving Ltrailer to zero by tilting the cab in the proper direction. Then, having gotten itrailer to zero or near zero, the cab must be straightened out to keep it there. Thus a restoring spring constant on Ltrailer is needed to drive Ltrailer to 0, and a restoring spring constant on i c a b is needed to straighten out the cab as Ltrailer approaches 0. This subnetwork depends upon the values of itrailer and Lcab and is independent of position. Once the truck is correctly oriented, the remaining objective is to dock at Y = 0. An acceptable solution is found to be independent of X, as long as the truck is not started too close to the left edge. An X dependence could be introduced to amplify the movement to Y = 0 when the truck is closer to the dock. This X dependence is equivalent to turning up the gain on the transfer function, and would best be captured by a multiplicative control term (Xtimes Y) using 0-T units. The truck and the controller are shown in Figure 1. The specific weights used were adjusted based on observed performance, balancing between sensitivity and damping. This controller was able to successfully *Current address: Bellcore MRE 2E-330, Morristown, NJ 07962-1910 USA.

Neural Computation 4,647-649 (1992) @ 1992 Massachusetts Institute of Technology

Robert E. Jenkins and Ben P. Yuhas

648

cx,,,:

)

cab

cab angle

traller

angle

Y

Figure 1: The truck and the network used to control it. The state of the truck is described by the X, Y coordinates of the back of the trailer along with three angles: the trailer relative to the dock, Ltrailer, the cab relative to the trailer, i c a b , and the angle of the wheel relative to the cab iwheel. The weights used do not constitute a unique solution. Increasing the input to hidden weights while maintaining their ratio (for correct stability) can be approximately compensated for by reducing the hidden to output weights and vice versa.

back the truck up to the dock from all random locations we observed, as long as the back of the trailer started at least 0.7 times the trailer length away from the left wall. This example demonstrates how intuitively decomposing the problem can be used to initialize the neural network's weights. In this specific example, by identifying the components of the problem and embedding their solutions in the network, a solution to the larger problem was obtained.

Neural-Network Solution through Problem Decomposition

649

References Nguyen, D. H., and Widrow, B. 1990. Neural networks for self-learning control systems. I E E E Control Syst. Mag. 18-23.

Received 17 December 1991; accepted 4 March 1992.

Communicated by Christoph von der Malsburg

~

Learning to Segment Images Using Dynamic Feature Binding Michael C. Mozer Department of Computer Science and Institute of Cognitive Science, University of Colorado, Boulder, CO 803094430 U S A

Richard S. Zemel Department of Computer Science, University of Toronto, Toronto, Ontario M5S 1A4

Marlene Behrmann Department of Psychology and Faculty of Medicine and Rotman Research lnstitute of Baycrest Centre, University of Toronto, Toronto, Ontario M5S ZAl

Christopher K. 1. Williams Department of Computer Science, University of Toronto, Toronfo, Ontario M5S ZA4

Despite the fact that complex visual scenes contain multiple, overlapping objects, people perform object recognition with ease and accuracy. One operation that facilitates recognition is an early segmentation process in which features of objects are grouped and labeled according to which object they belong. Current computational systems that perform this operation are based on predefined grouping heuristics. We describe a system called MAGIC that learns how to group features based on a set of presegmented examples. In many cases, MAGIC discovers grouping heuristics similar to those previously proposed, but it also has the capability of finding nonintuitive structural regularities in images. Grouping is performed by a relaxation network that attempts to dynamically bind related features. Features transmit a complexvalued signal (amplitude and phase) to one another; binding can thus be represented by phase locking related features. MAGIC’S training procedure is a generalization of recurrent backpropagation to complexvalued units. 1 Introduction

Recognizing an isolated object in an image is a demanding Computational task. The difficulty is greatly compounded when the image contains Neural Computation 4, 650-665 (1992)

@ 1992 Massachusetts Institute of Technology

Learning to Segment Images

651

multiple objects because image features are not grouped according to which object they belong. Without the capability to form such groupings, it would be necessary to undergo a massive search through all subsets of image features. For this reason, most machine vision recognition systems include a component that performs feature grouping or image segmentation (e.g., Guzman 1968; Lowe 1985; Marr 1982). Psychophysical and neuropsychological evidence suggests that the human visual system performs a similar operation (Duncan 1984; Farah 1990; Kahneman and Henik 1981; Treisman 1982). Image segmentation presents a circular problem: Objects cannot be identified until the image has been segmented, but unambiguous segmentation of the image requires knowledge of what objects are present. Fortunately, object recognition systems do not require precise segmentation: Simple heuristics can be used to group features, and although these heuristics are not infallible, they suffice for most recognition tasks. Further, the segmentation-recognition cycle can iterate, allowing the recognition system to propose refinements of the initial segmentation, which in turn refines the output of the recognition system (Hinton 1981; Hanson and Riseman 1978; Waltz 1975). A multitude of heuristics have been proposed for segmenting images. Gestalt psychologists have explored how people group elements of a display and have suggested a range of grouping principles that govern human perception. For example, there is evidence for the grouping of elements that are close together in space or time, that appear similar, that move together, or that form a closed figure (Rock and Palmer, 1990). Computer vision researchers have studied the problem from a more computational perspective. They have investigated methods of grouping elements of an image based on nonaccidental regdarities-feature combinations that are unlikely to occur by chance when several objects are juxtaposed, and are thus indicative of a single object. Kanade (1981) describes two such regularities, parallelism and skewed symmetry, and shows how finding instances of these regularities can constrain the possible interpretations of line drawings. Lowe and Binford (1982) find nonaccidental, significant groupings through a statistical analysis of images. They evaluate potential feature groupings with respect to a set of heuristics such as collinearity, proximity, and parallelism. The evaluation is based on a statistical measure of the likelihood that the grouping might have resulted from the random alignment of image features. Boldt et al. (1989) describe an algorithm for constructing lines from short line segments. The algorithm evaluates the goodness of fit of pairs of line segments in a small neighborhood based on relational measures (collinearity, proximity, and contrast similarity). Well matched pairs are replaced by longer segments, and the procedure is repeated. In these earlier approaches, the researchers have hypothesized a set of grouping heuristics and then tested their psychological validity or computational utility. In our work, we have taken an adaptive approach to the

652

M. C. Mozer et al.

Figure 1: Examples of randomly generated two-dimensional geometric contours. problem of image segmentation in which a system learns how to group features based on a set of examples. We call the system MAGIC, an acronym for multiple-object adaptive grouping of image components. In many cases MAGIC discovers grouphg heuristics similar to those proposed in earlier work, but it also has the capability of finding nonintuitive structural regularities in images. MAGIC is trained on a set of presegmented images containing multiple objects. By "presegmented" we mean that each image feature is labeled as to which object it belongs. MAGIC learns to detect configurations of the image features that have a consistent labeling in relation to one another across the training examples. Identifying these configurations then allows MAGIC to label features in novel, unsegmented images in a manner consistent with the training examples. 2 The Domain

Our initial work has been conducted in the domain of two-dimensional geometric contours, including rectangles, diamonds, crosses, triangles, hexagons, and octagons. The contours are constructed from four primitive feature types-oriented line segments at 0", 45", 90", and 135"-and are laid out on a 25 x 25 grid. At each location on the grid are units, called feature units, that represent each of the four primitive feature types. In our present experiments, images contain two contours. We exclude images in which the two contours share a comrnon edge. This permits a unique labeling of each feature. Examples of several randomly generated images containing rectangles and diamonds are shown in Figure 1. 3 Representing Feature Labelings

Before describing MAGIC, we must first discuss a representation that allows for the labeling of features. von der Malsburg (1981; von der

Learning to Segment Images

653

Malsburg and Schneider 1986), Gray et al. (1989), Eckhorn et al. (1988), and Strong and Whitehead (1989), among others, have suggested a biologically plausible mechanism of labeling through temporal correlations among neural signals, either the relative timing of neuronal spikes or the synchronization of oscillatory activities in the nervous system. The key idea here is that each processing unit conveys not just an activation value-average firing frequency in neural terms-but also a second, independent value that represents the relative phase of firing. The dynamic grouping or binding of a set of features is accomplished by aligning the phases of the features. A flurry of recent work on populations of coupled oscillators (e.g., Baldi and Meir 1990; Grossberg and Somers 1991; Eckhorn et al. 1990; Kammen et al. 1990) has shown that this type of binding can be achieved using simple dynamic rules. However, most of this work assumes a relatively homogeneous pattern of connectivity among the oscillators and has not attempted to tackle problems in computer vision such as image segmentation, where each oscillator represents an image feature, and more selective connections between the oscillators are needed to simulate the selective binding of appropriate subsets of image features. A few exceptions exist (Goebel 1991a,b; Hummel and Biederman 1992; Lumer and Huberman 1991; Sporns et al. 1991); in these systems, the pattern of connectivity among oscillators is specified by simple predetermined grouping heuristics.' In MAGIC, the activity of a feature unit is a complex value with amplitude and phase components. The phase represents a labeling of the feature, and the amplitude represents the confidence in that labeling. The amplitude ranges from 0 to 1, with 0 indicating a complete lack of confidence and 1 indicating absolute certainty. There is no explicit representation of whether a feature is present or absent in an image. Rather, absent features are clamped off-their amplitudes are forced to remain at 0-which eliminates their ability to influence other units, as will become clear when the activation dynamics are presented later.

4 The Architecture When an image is presented to MAGIC, units representing features absent in the image are clamped off and units representing present features are assigned random initial phases and small amplitudes. MAGIC'S task is to assign appropriate phase values to the units. Thus, the network performs a type of pattern completion. 'In the Sporns et a / . model, the coupling strength between two connected units changes dynamically on a fast time scale, but this adaptation is related to achieving temporal correlations, not learning grouping principles.

654

M. C. Mozer et al.

Figure 2: The architecture of MAGIC. The lower (input) layer contains the feature units; the upper layer contains the hidden units. Each layer is arranged in a spatiotopic array with a number of different feature types at each position in the array. Each plane in the feature layer corresponds to a different feature type. The grayed hidden units are reciprocally connected to all features in the corresponding grayed region of the feature layer. The lines between layers represent projections in both directions.

The network architecture consists of two layers of units, as shown in Figure 2. The lower (input) layer contains the feature units, arranged in spatiotopic arrays with one array per feature type. The upper layer contains hidden units that help to align the phases of the feature units; their response properties are determined by training. There are interlayer connections, but no intralayer connections. Each hidden unit is reciprocally connected to the units in a local spatial region of all feature arrays. We refer to this region as a patch; in our current simulations, the patch has dimensions 4 x 4. For each patch there is a corresponding fixed-size pool of hidden units. To achieve uniformity of response across the image, the pools are arranged in a spatiotopic array in which neighboring pools respond to neighboring patches and the patch-to-pool weights are constrained to be the same at all locations in the array. The feature units activate the hidden units, which in turn feed back to the feature units. Through a relaxation process, the system settles on an assignment of phases to the features. One might consider an alternative architecture in which feature units were directly connected to one another (Hummel and Biederman 1992). However, this architecture is in principle not as powerful as the one we propose because it does not allow for higher order contingencies among features.

Learning to Segment Images

655

5 Network Dynamics

The dynamics of MAGIC are based on a mean-field approximation to a stochastic network of directional units, described in Zemel et al. (1992). A variant of this model was independently developed by Gislh et al. (1991). These papers provide a justification of the activation rule and error function in terms of an energy minimization formalism. The response of each feature unit i, x,,is a complex value in polar form, ( a J p, , ) , where a, is the amplitude and p J is the phase. Similarly, the response of each hidden unit j , y, has components (b,,4,). The weight connecting unit i to unit j , wIJ,is also complex valued, having components (p,,,O,,). The activation rule we propose is a generalization of the dot product to the complex domain. The net input to hidden unit j at time step t 1 is

+

net,(t

+ 1) = x ( t ) . w, =

EM)q

=

({(CJal(t)p]l

cos[pl(t) -

+

'12

sin[pl(t) -

(CJal(t)p,l

3

where the asterisk denotes the complex conjugate. The net input is passed through a squashing nonlinearity that maps the amplitude of the response from the range 0 + co to 0 - 1 but leaves the phase unaffected: net,(t) 11 [ ~ ( t ) ]

y,(t)

=

X T W

where m,(t) is the magnitude of the net input, Inet,(t)l, and Ik is the modified Bessel function of the first kind and order k. The squashing function Il(rn)/Io(m) is shown in Figure 3. The intuition underlying the activation rule is as follows. The amplitude (confidence) of a hidden unit, b,, should be monotonically related to how well the feature response pattern matches the hidden unit weight vector, just as in the standard real-valued activation rule. Indeed, one can readily see that if the feature and weight phases are equal ( p , = O,,), the rule for bl reduces to the real-valued case. Even if the feature and weight phases differ by a constant ( p , = O,, c), bJ is unaffected. This is a critical property of the activation rule: Because absolute phase values have no intrinsic meaning, the response of a unit should depend only on the relative phases. That is, its response should be rotation invariant. The activation rule achieves this by essentially ignoring the average difference

+

M. C. Mozer et al.

656

a l r

Figure 3: The squashing function G = I ~ ( m ) / I ~ ( m )The . amplitude of the net input to a unit is passed through this function to obtain the output amplitude.

in phase between the feature units and the weights. The hidden phase,

91, reflects this average difference? The flow of activation from the hidden layer to the feature layer follows the same dynamics as the flow from the feature layer to the hidden layer: net,(t

+ 1)= y(t + 1). w,

and net,(t)11 [ml(t)l m d t ) I0 b I ( f ) J if feature i is present in the image, or xl(t) = 0 otherwise. Note that update is sequential by layer: the feature units activate the hidden units, which then activate the feature units. In MAGIC, the weight matrix is constrained to be Hermitian, i.e., w,I = "5. This is a generalization of weight symmetry to the complex domain. Weight symmetry ensures that MAGIC will converge to a fixed point. The proof of this is a generalization of Hopfield's (1984) result to complex units, discrete-time update, and a two-layer architecture with sequential layer updates and no intralayer connections. x z ( t )=

~~

2To elaborate, the activation rule produces a 9, that yields the minimum of the following expression: [a1cosp,

d, =

~

P,~cos(Q,,

+ q,)]* + [aI sinp, - P,, sin(@,,+ 9,)12

I

This is a measure of the distance between the feature and weight vectors given a free parameter 9, that specifies a global phase shift of the weight vector.

Learning to Segment Images

657

6 Learning Algorithm During training, we would like the hidden units to learn to detect configurations of features that reliably indicate phase relationships among the features. For instance, if the contours in the image contain extended horizontal lines, one hidden unit might learn to respond to a collinear arrangement of horizontal segments. Because the unit's response depends on the phase pattern as well as the activity pattern, it will be strongest if the segments all have the same phase value. We have experimented with a variety of algorithms for training MAGIC, including an extension of soft competitive learning (Nowlan 1990)to complex-valued units, recurrent backpropagation (Almeida 1987; Pineda 19871, backpropagation through time (Rumelhart et al. 19861, a backpropagation autoencoder paradigm in which patches of the image are processed independently, and an autoencoder in which the patches are processed simultaneously and their results are combined. The algorithm with which we have had greatest success, however, is a relatively simple single-step error propagation algorithm. It involves running the network for a fixed number of iterations and, for each iteration, using backpropagation to adjust the weights so that the feature phase pattern better matches a target phase pattern. Each training trial proceeds as follows: 1. A training example is generated at random. This involves selecting two contours and instantiating them in an image. The features of one contour have target phase 0" and the features of the other contour have target phase 180".

2. The training example is presented to MAGIC by setting the initial amplitude of a feature unit to 0.1 if its corresponding image feature is present, or clamping it at 0.0 otherwise. The phases of the feature units are set to random values in the range 0" to 360". 3. Activity is allowed to flow from the feature units to the hidden units and back to the feature units.

4. The new phase pattern over the feature units is compared to the target phase pattern (see step l),and an error measure is computed:

where in, is the magnitude of the net input to feature unit i, pi is the actual phase of unit i, and p i is the target phase. This is a log likelihood error function derived from the formalism described in Zemel et al. (1992). In this formalism, the activities of units represent a probability distribution over phase values. The error function is

M. C. Mozer et al.

658

the asymmetric divergence between the actual and target phase distributions. The aim is to minimize the difference between the target and actual phases and to maximize the amplitude, or confidence, of the response. The error measure factors out the absolute difference between the target and actual phases. That is, E is minimized when j j , - p l is equal for all i, regardless of the value of fj, - p l . 5. Using a generalization of backpropagation to complex valued units, error gradients are computed for the feature-to-hidden and hiddento-feature weights.

6. Steps 3-5 are repeated for a maximum of 30 iterations. The trial is terminated if the error increases on five consecutive iterations. 7. Weights are updated by an amount proportional to the average error gradient over iterations. The constraint that wil = zuI; is enforced in proby modifying w , ~in proportion to V,, Vf and modifying w,, portion to V; + V,, where V, denotes the gradient with respect to the weight to i from j. To achieve a translation-invariant response of the hidden units, hidden units of the same "type" responding to different regions of the image are constrained to have the same weights. This is achieved by having single set of underlying weight parameters that is replicated across the hidden layer. The appropriate gradient descent algorithm for these parameters is to adjust them in proportion to the sum of the gradients with respect to each of their instantiations.

+

The algorithm is far less successful when a target phase pattern is given just on the final iteration or final k iterations, rather than on each iteration. Surprisingly, the algorithm operates little better when error signals are propagated back through time. The simulations reported below use a learning rate parameter of 0.005 for the amplitudes and 0.02 for the phases. On the order of 10,000 learning trials are required for stable performance, although MAGIC rapidly picks up on the most salient aspects of the domain. 7 Simulation Results

We trained a network with 20 hidden units per pool on examples like those shown in Figure 1. The resulting weights are shown in Figure 4. Each hidden unit attempts to detect and reinstantiate activity patterns that match its weights. One clear and prevalent pattern in the weights is the collinear arrangement of segments of a given orientation, all having the same phase value. When a hidden unit having weights of this form responds to a patch of the feature array, it tries to align the phases of the patch with the phases of its weight vector. By synchronizing the phases

Learning to Segment Images

659

-

1

Figure 4: Complex feature-to-hidden connection weights learned by MAGIC. In this simulation, there are connections from a 4 x 4 patch of the image to a pool of 20 hidden units. (Theseconnections are replicated for each patch in the image to achieve a uniformity of hidden unit response.) The connections feeding into each hidden unit are presented on a light gray background. Each hidden unit has a total of 64 incoming w e i g h t s 4 x 4 locations in its receptive field and four feature types at each location. The weights are further grouped by feature type (dark gray background), and for each feature type they are arranged in a 4 x 4 pattern homologous to the image patch itself. The area of a circle is proportional to the amplitude of the corresponding weight, the orientation of the internal tick mark represents the phase angle. Due to the symmetry constraint, hidden-to-feature weights (not shown) mirror the feature-to-hidden weights.

M. C. Mozer et al.

660

1

1; I

Iterabon 6

Iterabon 10

..-; 1

h

-

Iterauon 25

Figure 5: An example of MAGIC segmenting an image. The "iteration" refers to the number of times activity has flowed from the feature units to the hidden units and back. The phase value of a feature is represented by a gray level. The cyclic phase continuum can be approximated only by a linear gray level continuum, but the basic information is conveyed nonetheless. of features, it acts to group the features. Thus, one can interpret the weight vectors as the rules by which features are grouped. Whereas traditional grouping principles indicate the conditions under which features should be bound together as part of the same object, the grouping principles learned by MAGIC also indicate when features should be segregated into different objects. For example, the weights of the vertical and horizontal segments are generally 180" out of phase with the diagonal segments. This allows MAGIC to segregate the vertical and horizontal features of a rectangle from the diagonal features of a diamond (see Fig. 1, left panel). We had anticipated that the weights to each hidden unit would contain two phase values at most because each image patch contains at most two objects. However, some units make use of three or more phases, suggesting that the hidden unit is performing several distinct functions. As is the usual case with hidden unit weights, these patterns are difficult to interpret. Figure 5 presents an example of the network segmenting an image. The image contains two rectangles. The top left panel shows the features

Learning to Segment Images

661

of the rectangles and their initial random phases. The succeeding panels show the network's response during the relaxation process. The lower right panel shows the network response at equilibrium. Features of each object have been assigned a uniform phase, and the two objects are 180" out of phase. The task here may appear simple, but it is quite challenging due to the illusory rectangle generated by the overlapping rectangles. 8 Alternative Representation of Feature Labeling

To perform the image segmentation task, each feature unit needs to maintain two independent pieces of information: a label assigned to the feature and a measure of confidence associated with the label. In MAGIC, these two quantities are encoded by the phase and amplitude of a unit, respectively. This polar representation is just one of many possible encodings, and requires some justification due to the complexity of the resulting network dynamics. An alternative we have considered-which seems promising at first glance but has serious drawbacks-is the rectangular coordinate analog of the polar representation. In this scheme, a feature unit conveys values indicating belief in the hypotheses that the feature is part of object A or object B, where A and B are arbitrary names. For example, the activities ( 1 , O ) and ( 0 , l ) indicate complete confidence that the feature belongs to object A or B,respectively, (0,O) indicates that nothing is known about which object the feature belongs to, and intermediate values indicate intermediate degrees of confidence in the two hypotheses. The rectangular and polar representations are equivalent in the sense that one can be transformed into the other.3 The rectangular scheme has two primary benefits. First, the activation dynamics are simpler. Second, it allows for the simultaneous and explicit consideration of multiple labeling hypotheses, whereas the polar scheme allows for the consideration of only one label at a time. However, these benefits are obtained at the expense of presuming a correspondence between absolute phase values and objects. (In the rectangular scheme we described, A and B always have phases 0" and 90°, respectively, obtained by transforming the rectangular coordinates to polar coordinates.) The key drawback of absolute phase values is that a local patch of the image cannot possibly determine which label is correct. A patch containing, say, several collinear horizontal segments can determine only that the segments should be assigned the same label. Preliminary simulations indicate that the resulting ambiguity causes severe difficulties in processing. In contrast, the polar scheme allows the network to express the relative labelings of two segments-e.g., that they should be assigned the same label-without needing to specify the particular label. 'Yann Le Cun (personal communication, 1991) has independently developed the notion of using the rectangular encoding scheme in the domain of adaptive image segmentation.

M. C. Mozer et al.

662

9 Current Directions

We are currently extending MAGIC in several directions, which we outline here. 0

0

0

We have not addressed the question of how the continuous phase representation is transformed into a discrete object label. One may simply quantize the phase values such that all phases in a given range are assigned the same label. This quantization step has the extremely interesting property that it allows for a hierarchical decomposition of objects. If the quantization is coarse, only gross phase differences matter, allowing one object to be distinguished from another. As the quantization becomes finer, an object is divided into its components. Thus, the quantization level in effect specifies whether the image is parsed into objects, parts of objects, parts of parts of objects, etc. This hierarchical decomposition of objects can be achieved only if the phase values reflect the internal structure of an object. For example, in the domain of geometric contours, MAGIC would not only have to assign one contour a different phase value than another, but it would also have to assign each edge composing a contour a slightly different phase than each other edge (assuming that one considers the edges to be the "parts" of the contour). Somewhat surprisingly, MAGIC does exactly this because the linkage between segments of an edge is stronger than the linkage between two edges. This is due to the fact that collinear features occur in images with much higher frequency than do corners. Thus, the relative frequency of feature configurations leads to a natural principle for the hierarchical decomposition of objects. Although MAGIC is trained on pairs of objects, it has the potential of processing more than two objects at a time. For example, with three overlapping objects, MAGIC attempts to push each pair 180" out of phase but ends up with a best constraint satisfaction solution in which each object is 120" out of phase with each other. We are exploring the limits of how many objects MAGIC can process at a time. Spatially local grouping principles are unlikely to be sufficient for the image segmentation task. Indeed, we have encountered incorrect solutions produced by MAGIC that are locally consistent but globally inconsistent. To solve this problem, we are investigating an architecture in which the image is processed at several spatial scales simultaneously. Fine-scale detectors respond to the sort of detail shown in Figure 4, while coarser-scale detectors respond to more global structure but with less spatial resolution.

Learning to Segment Images

663

Simulations are under way to examine MAGIC’S performance on real-world images-overlapping handwritten letters and digitswhere it is somewhat less clear to which types of patterns the hidden units should respond. 0

Behrmann et al. (1992) are conducting psychological experiments to examine whether limitations of the model match human limitations.

Acknowledgments This research was supported by NSF Presidential Young Investigator award IRI-9058450, Grant 90-21 from the James S. McDonnell Foundation, and DEC external research Grant 1250 to MM; by a National Sciences and Engineering Research Council Postgraduate Scholarship to Rz; and by an NSERC operating grant to MB. Our thanks to Paul Smolensky, Radford Neal, Geoffrey Hinton, and Jurgen Schmidhuber for helpful comments regarding this work.

References Almeida, L. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. Proceedings of the I E E E First Annual lnternational Conference on Neural Networks, Vol. 2, M. Caudill and C. Butler, eds., pp. 609-618. EEE Publishing Services, San Diego, CA. Baldi, P., and Meir, R. 1990. Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Cornp. 2, 45&471. Behrmann, M., Zemel, R. S., and Mozer, M. C. 1992. Perceptual organization and object-based attention. Manuscript in preparation. Boldt, M., Weiss, R., and Riseman, E. 1989. Token-based extraction of straight lines. IEEE Transact. Syst. Man Cybern. 19, 1581-1594. Duncan, J. 1984. Selective attention and the organization of visual information. j . Exp. Psychol. General 113, 501-517. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboek, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. 1990. Feature linking via synchronization among distributed assemblies: Simulations of results from cat visual cortex. Neural Comp. 2, 293-307. Farah, M. J. (1990). Visual Agnosia. The MIT Press/Bradford Books, Cambridge, MA. Gislen, L., Peterson, C., and Soderberg, 8. 1991. Rotor Neurons-Basic Formalism and Dynamics (LU TP 91-21). University of Lund, Department of Theoretical Physics, Lund, Sweden. Goebel, R. 1991a. An oscillatory neural network model of visual attention, pattern recognition, and response generation. Manuscript in preparation.

664

M. C. Mozer et al.

Goebel, R. 1991b. The role of attention and short-term memory for symbol manipulation: A neural network model that learns to evaluate simple LISP expressions. In Cognition and Computer Programming, K. F. Wender, F. Schmalhofer, and H. D. Boecker, eds. Ablex Publishing Corporation, Norwood, NJ. Gray, C. M., Koenig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit intercolumnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Grossberg, S., and Somers, D. 1991. Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks 4,453466. Guzman, A. 1968. Decomposition of a visual scene into three-dimensional bodies. AFIPS Fall Joint Comput. Conf. 33, 291-304. Hanson, A. R., and Riseman, E. M. 1978. Computer Vision Systems. Academic Press, New York. Hinton, G. E. 1981. A parallel computation that assigns canonical object-based frames of reference. In Proceedings of the Seventh lnternational Joint Conference on Artificial Intelligence, pp. 683-685. Morgan Kaufmann, Los Altos, CA. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Hummel, J. E., and Biederman, I. 1992. Dynamic binding in a neural network for shape recognition. Psychol. Rev., in press. Kahneman, D., and Henik, A. 1981. Perceptual organization and attention. In Perceptual Organization, M. Kubovy and J. R. Pomerantz, eds., pp. 181-211. Erlbaum, Hillsdale, NJ. Kammen, D., Koch, C., and Holmes, P. J. 1990. Collective oscillations in the visual cortex. In Advances in Neural lnformation Processing Systems2, D. S. Touretzky, ed., pp. 76-83. Morgan Kaufmann, San Mateo, CA. Kanade, T. 1981. Recovery of the three-dimensional shape of an object from a single view. Artificial Intell. 17,409-460. Lowe, D. G. 1985. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Boston. Lowe, D. G., and Binford, T. 0. 1982. Segmentation and aggregation: An approach to figure-ground phenomena. Proceedings of the DARPA IUS Workshop, pp. 168-178. Palo Alto, CA. Lumer, E., and Huberman, 8. A. 1992. Binding hierarchies: A basis for dynamic perceptual grouping. Neural Comp. 4, 341-355. Marr, D. 1982. Vision. Freeman, San Francisco. Nowlan, S. J. 1990. M a x likelihood competition in RBF networks. Tech. Rep. CRGTR-90-2. Toronto, Canada: University of Toronto, Department of Computer Science, Connectionist Research Group. Pineda, F. 1987. Generalization of back propagation to recurrent neural networks. Phys. Rev. Lett. 19, 2229-2232. Rock, I., and Palmer, S. E. 1990. The legacy of Gestalt psychology. Sci. Amer. 263, 84-90. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Ex-

Learning to Segment Images

665

plorations in theMicvostructure of Cognition. VolumeI: Foundations, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. The MIT Press/Bradford Books, Cambridge, MA. Sporns, O., Tononi, G., and Edelman, G. M. 1991. Modeling perceptual grouping and figure-ground segregation by means of active reentrant connections. Proc. Natl. Acad. Sci. 88, 129-133. Strong, G. W., and Whitehead, B. A. 1989. A solution to the tag-assignment problem for neural networks. Behav. Brain Sci. 12, 381433. Treisman, A. 1982. Perceptual grouping and attention in visual search for features and objects. J. Exp. Psychol. Human Percept. Perform. 8, 194-214. von der Malsburg, C. 1981. The correlation theory of brain function (Internal Report 81-2). Goettingen: Department of Neurobiology, Max Planck Institute for Biophysical Chemistry. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Bid. Cybern. 54, 29-40. Waltz, D. A. 1975. Generating semantic descriptions from drawings of scenes with shadows. In The Psychology of Computer Vision, P. H. Winston, ed., pp. 19-92. McGraw-Hill, New York. Zemel, R. S., Williams, C. K. I., and Mozer, M. C. 1992. Adaptive networks of directional units. Submitted for publication.

Received 29 August 1991; accepted 25 February 1992.

This article has been cited by: 2. S. Weng, H. Wersing, J.J. Steil, H. Ritter. 2006. Learning Lateral Interactions for Feature Binding and Sensory Segmentation From Prototypic Basis Interactions. IEEE Transactions on Neural Networks 17:4, 843-862. [CrossRef] 3. D. Wang. 2005. The Time Dimension for Scene Analysis. IEEE Transactions on Neural Networks 16:6, 1401-1426. [CrossRef] 4. Mary A. Peterson, Daniel W. Lampignano. 2003. Implicit Memory for Novel Figure-Ground Displays Includes a History of Cross-Border Competition. Journal of Experimental Psychology: Human Perception and Performance 29:4, 808-822. [CrossRef] 5. Michael C. Mozer. 2002. Frames of reference in unilateral neglect and visual perception: A computational perspective. Psychological Review 109:1, 156-185. [CrossRef] 6. Shaun P. Vecera, Edward K. Vogel, Geoffrey F. Woodman. 2002. Lower region: A new cue for figure-ground assignment. Journal of Experimental Psychology: General 131:2, 194-205. [CrossRef] 7. Richard S. Zemel, Marlene Behrmann, Michael C. Mozer, Daphne Bavelier. 2002. Experience-dependent perceptual grouping and object-based attention. Journal of Experimental Psychology: Human Perception and Performance 28:1, 202-217. [CrossRef] 8. Edward K. Vogel, Geoffrey F. Woodman, Steven J. Luck. 2001. Storage of features, conjunctions, and objects in visual working memory. Journal of Experimental Psychology: Human Perception and Performance 27:1, 92-114. [CrossRef] 9. Shaun P. Vecera , Kendra S. Gilds . 1998. What Processing Is Impaired in Apperceptive Agnosia? Evidence from Normal SubjectsWhat Processing Is Impaired in Apperceptive Agnosia? Evidence from Normal Subjects. Journal of Cognitive Neuroscience 10:5, 568-580. [Abstract] [PDF] [PDF Plus] 10. Robert L. Goldstone. 1998. PERCEPTUAL LEARNING. Annual Review of Psychology 49:1, 585-612. [CrossRef] 11. Hermina J.M. Tabachneck-Schijf, Anthony M. Leonardo, Herbert A. Simon. 1997. CaMeRa: A Computational Model of Multiple Representations. Cognitive Science 21:3, 305-350. [CrossRef] 12. DeLiang Wang, David Terman. 1997. Image Segmentation Based on Oscillatory CorrelationImage Segmentation Based on Oscillatory Correlation. Neural Computation 9:4, 805-836. [Abstract] [PDF] [PDF Plus]

Communicated by Haim Sompolinsky

Stimulus-Dependent Assembly Formation of Oscillatory Responses: 111. Learning Peter Konig Bernd Janosch Thomas B. Schillen Max-Planck-lnstitut filr Hirnforschung, Deutschordenstrasse 46,6000 Frankfurt 71, Germany

A temporal structure of neuronal activity has been suggested as a potential mechanism for defining cell assemblies in the brain. This concept has recently gained support by the observation of stimulusdependent oscillatory activity in the visual cortex of the cat. Furthermore, experimental evidence has been found showing the formation and segregation of synchronously oscillating cell assemblies in response to various stimulus conditions. In previous work, we have demonstrated that a network of neuronal oscillators coupled by synchronizing and desynchronizing delay connections can exhibit a temporal structure of responses, which closely resembles experimental observations. In this paper, we investigate the self-organization of synchronizing and desynchronizing coupling connections by local learning rules. Based on recent experimental observations, we modify synchronizing connections according to a two-threshold learning rule, involving synaptic potentiation and depression. This rule is generalized to its functional inverse for weight changes of desynchronizing connections. We show that after training, the resulting network exhibits stimulus-dependent formation and segregation of oscillatory assemblies in agreement with the experimental data. These results indicate that local learning rules during ontogenesis can suffice to develop a connectivity pattern in support of the observed temporal structure of stimulus responses in cat visual cortex.

1 Introduction During recent years, the temporal structure of neuronal activity has attracted much interest due to its potential role in visual processing (von der Malsburg 1981; Singer 1990). Based on theoretical considerations, it has been proposed that the temporal correlation of neuronal responses could be used by the mammalian brain to solve the binding problem (von der Malsburg 1981). In particular, the synchronization of oscillatory Neural Computation 4,666-681 (1992) @ 1992 Massachusetts Institute of Technology

Stimulus-DependentOscillatory Responses

667

responses would allow the unique definition of neuronal assemblies representing sensory stimuli (von der Malsburg and Schneider 1986). The presence of stimulus-driven oscillations of neuronal activity has been found in the olfactory bulb of the rabbit and in the cat visual cortex (Freeman 1975; Gray and Singer 1987,1989; Eckhorn et al. 1988). Furthermore, both the stimulus-dependent synchronization and desynchronization of oscillatory neuronal responses have recently been demonstrated: light bars moving collinearly were shown to induce synchronous activity of neuronal responses (Gray et al. 1989). In contrast, two superimposed light bars moving in different directions activated two synchronously oscillating cell assemblies that exhibited no constant phase relationship among one another (Engel et al. 1991). Previously, we presented a network model of coupled delayed nonlinear oscillators that exhibits stimulus-dependent assembly formation of oscillatory responses in close analogy to experimental observations (Konig and Schillen 1990,1991; Schillen and Konig 1990, 1991). The network incorporates two types of delay connections that actively synchronize and desynchronize oscillatory assemblies. As discussed in Schillen and Konig (1991), we consider the stimulus-dependent active desynchronization to be an important complement to synchronizing mechanisms if the temporal structure of neuronal activity is to be employed in sensory processing. In this paper, we investigate the self-organization of synchronizing and desynchronizing delay connections by local learning rules. Motivated by recent experimental observations in rat visual cortex (Artola et al. 1990) we modify synchronizing connections according to a twothreshold learning rule, involving synaptic potentiation and depression. We generalize this modification rule to its functional inverse for weight changes of the desynchronizing connections. We show that the resulting network exhibits stimulus-dependent assembly formation of oscillatory responses which are similar to those found in physiological experiments (Engel et al. 1991). 2 Oscillatory Network

Model

We investigate the temporal structure of neuronal activity in a network of coupled delayed nonlinear oscillators (Konig and Schillen 19911. Elementary oscillators consist of an excitatory unit u,, coupled with delay ~~i to an inhibitory unit ui, which projects back to unit u, with delay ri, (Fig. 1A). When externaI stimulus input is applied to the excitatory unit ue, this system exhibits a stimulus-dependent transfer from a stable fixed point to a limit cycle oscillation. For details of the dynamics of the system refer to Konig and Schillen (1991). Oscillatory elements are coupled by synchronizing and desynchronizing delay connections as described before: Synchronizing connections

668

P. Konig, B. Janosch, and T. B. Schillen

B

Figure 1: (A) Basic oscillatory element implemented by coupling an excitatory unit (0) with an inhibitory unit (0) using delay connections. An additional unit (0) allows for external input of a stimulus. t, time; x ( t ) , unit activity; F ( x ) , sigmoidal output function with threshold 8; w,coupling weight; T , delay time; i e ( t ) , external input. Subscripts: e, excitatory unit; i, inhibitory unit. For further details see Konig and Schillen (1991). (B)Oscillatory elements coupled by two types of delay connections. Connections between the excitatory unit of one oscillator and the inhibitory unit of another (w:),dashed) synchronize the activity of the respective oscillators (Konig and Schillen 1991). Connections between excitatory units (wg), dotted) are used to desynchronize oscillatory activity (Schillen and Konig 1991). Note that the panel is meant to demonstrate only the employed types of coupling connections and not the model's connectivity pattern. Notation: Throughout this paper, w(')denotes the coupling weight between oscillators which are r oscillator positions apart (r-nearest neighbor coupling).

(~2')

originate from the excitatory unit of one oscillator and terminate at the inhibitory unit of another (Fig. lB, dashed) (Konig and Schillen 1991). Couplings between the excitatory unit of one oscillator and the excitatory unit of an other (w:)) are used to achieve desynchronization of oscillatory activity (Fig. lB, dotted) (Schillen and Konig 1991). We choose the configuration of the experiment by Engel et al. (1991) for our investigation of self-organization in an oscillatory network. In this experiment, several cells with receptive fields of different preferred orientations at closely overlapping retinal locations were stimulated with either a single or two overlapping moving light bars (Fig. 48). Using these two stimulus conditions, both synchronization and desynchronization of oscillatory neuronal responses could be observed.

Stimulus-Dependent Oscillatory Responses

669

In our model, we represent this experimental situation (Engel et al. 1991) by a one-dimensional chain of 16 oscillators (Fig. 4A). Each oscillator represents a population of cells with one particular orientation preference. For all the oscillators, we assume a continuous sequence of preferred orientations in steps of 11.25'. The receptive fields of all oscillators are located at the same retinal position. We simulate stimulus light bars by distributions of input activity corresponding to a gaussian tuning of preferred orientations with 30" tuning width (Fig. 4C, left). The orientation of a stimulus bar is represented by centering the input distribution appropriately. In the network's initial configuration, each oscillator is coupled by synchronizing and desynchronizing delay connections with all other oscillators in the chain. Initially, coupling weights wz)and wg) are homogeneously distributed. Note that, thus, no topology is imposed a priori on the coupling connections. Simulations were performed with the MENS modeling environment (Schillen 1991) on a VAXstation 3100. 3 Learning Rules

We train the network with a pseudorandom sequence of input stimuli corresponding to the model's 16 different preferred orientations. Each stimulus is presented for a period of At - 200msec. Synaptic weight changes occur at the end of each stimulus presentation. Based on recent observations by Artola et al. (1990) in slices from rat visual cortex, we use the following two-threshold learning rule (ABS rule) for weight changes of the synchronizing connections in our network:

(~2))

Au(t) = €iiPre(t)f~~s(iipost(t)) where t is time, Aw is the synaptic weight change, E is a rate constant, ,i and iipost are respective mean activities of the pre- and postsynaptic unit, and

{:::

ii 5 0,

fABs(il) =

C1 : 01 < 6 5 02 02< 3

with C1 < Co < 0 < C2

is a two-threshold function of the postsynaptic activity with thresholds O1 and 0, (Fig. 2, solid; Artola etal. 1990). We calculate a unit's mean activity u during a period of stimulus presentation according to 1 t+At ii(t) = a(t')dt'+ao At t where the offset a0 ensures the positivity of the integral. In physiological terms, lz could correspond to an indicator of integrated neuronal activity (At - 200 msec) such as intracellular concentrations of second messengers

/

P. Konig, B. Janosch, and T. B. Schillen

670

r------I

I I

I I

-------

1

I

t

I

-1

I

L

-

a

I I

-------

Figure 2: Functions fAsS of postsynaptic activity employed with the twothreshold ABS (solid) and anti-ABS (dashed) learning rules. ABS rule: Mean postsynaptic activity a exceeding threshold 02 leads to synaptic weight potentiation (LTP in Artola et al. 1990). Threshold 62 is above the ”spiking” threshold 0 of the output function F ( x ) of units in our model. Activity u intermediate between thresholds 81 and & leads to a depression of synaptic weights (LTD in Artola et al. 1990). Threshold O1 is below ”spiking” threshold 0. For little postsynaptic activity, ii 5 61, we assume a small negative value forfABS. This provides a slow degeneration of synaptic connections that consistently fail to excite their postsynaptic target cells. ABS rule learning is employed for modifications of synchronizing delay connections in our model. Anti-ABS rule: For anti-ABS weight changes we exchange the regions of potentiation and depression of the ABS learning rule: 82 < a, depression; 81 < ii 5 02, potentiation. Anti-ABS learning applies to modifications of desynchronizing delay connections in our network.

like Ca2+ or IP3 (Berridge and Irvine 1989). Sign changes of synaptic weights are excluded by applying a lower boundary to the modified weights: w(t At) = max[O, w(t) A w ( t ) ] . This ensures that a synapse maintains its excitatory specificity during training. Even though synaptic modifications are fairly balanced for each unit, total synaptic weights can slowly diverge. Therefore, we keep a unit’s total synaptic weight constant by normalization (von der Malsburg 1973). For the network‘s desynchronizing connections (wg)), we generalize the above modification scheme to a corresponding two-threshold antiABS learning rule. This is achieved by modifying function f A ~ sas depicted in Figure 2 (dashed, C2 < Co < 0 < C,). Note that we allow the ABS learning rule of weights wz)to also affect each oscillator’s intrinsic excitatory connection w2’.However, we do not

+

+

Stimulus-DependentOscillatory Responses

671

modify inhibitory connections w!:)since there is currently no physiological evidence for plasticity of inhibitory synapses. 4 Learning Synchronizing and Desynchronizing Delay Connections

When the network is presented with training stimuli, the described learning functions lead to rapid changes of synaptic weights. After several hundred stimulus presentations only those synchronizing connections that couple oscillators differing by not more than 22.5" in their preferred orientations maintain nonzero synaptic weights. For all other synchronizing connections synaptic weights degenerate to near zero (Fig. 3A). The desynchronizing connections develop a bipartite weight distribution that provides active desynchronization between oscillators that differ by 22.5", . . . ,56.25" in their orientation preferences (Fig. 3B). Since the dynamics of desynchronizing weight changes is more susceptible to imbalances in the sequence of training stimuli, we use a reduced learning rate tee= 0.1 eel. Figure 3C shows the asymptotic weight distributions after 3200 stimulus presentations at the end of the training phase. These distributions closely resemble those implemented in a previous network (Schillen and Konig 1991). Note that for the synchronizing connections the tuning width of synaptic weights (orientation difference between maximum and half maximum weight; l5? corresponds to only about half the width of the orientation tuning assumed for each neuronal population (30"). After training is completed we disable further weight changes. Then we test the network with (1) a single stimulus (Fig. 4, left column), and (2) two superimposed stimuli of orthogonal orientation (Fig. 4, right column) as in the experimental situation (Engel et a2. 1991; Schillen and Konig 1991). With a single stimulus, the activity of all responding cells is well synchronized. This identifies the neuronal populations that belong to the single oscillatory assembly representing the stimulus (Fig. 4E, left). In the case of the two superimposed stimuli, each stimulus is represented by its respective assembly of synchronously oscillating cells. However, because of the desynchronizing connections a rapidly varying phase relation is found between the two assemblies (Fig. 4D, right). Averaging over 20 epochs results in a reduced cross-correlationfunction with a random phase relation (Fig. 4E, right). Without this active desynchronization the oscillatory responses to the superimposed stimuli become completely synchronized (data not shown). [See Schillen and Konig (1991) for a detailed discussion of the desynchronization of assemblies defined by temporal structure of neuronal activity.] The observed synchronization and desynchronization of oscillatory assemblies in this model agrees well with the experimental evidence (Engel et a2. 1991). In a second investigation, we train a network with pairs of input stimuli corresponding to random combinations of two different stimulus orientations. In order to provide a better resolution of orientation differ-

672

A

P. Konig, 8. Janosch, and T. B. Schillen

0.8I

1

source unit

16 source unit

32

Figure 3: Development of synaptic weights during training. (A) Development of weights wi)of synchronizing connections. Depicted are the coupling weights from all 16 excitatory units in the network (Fig. 4A) to inhibitory unit 8. Each presentation of a training stimulus lasts 200msec. (B) Development of weights &) of desynchronizing connections. The panel shows the weights from the network‘s other excitatory units to excitatory unit 8. (C) Weight distributions for the coupling connections shown in (A) and (B) after 3200 stimulus presentations. wi),solid; wL),dashed. Parameters: for modifying synchronizing connections: 01 = 1.0, 02 = 2.4, CO = -3.0, C1 = -10.0, C2 = 8.0; for modifying desynchronizing connections: 01 = 1.2, 0, == 2.6, CO= -3.0, C1 = 8.0, C2 = -10.0; C,w$)= 3.35; C,w$:’= 3.35; wj:’ = -1.0; input strength 0.55; other parameters as in Schillen and Konig (1991). (D) Weight distributions analogous to (C) after training a network of 32 oscillatory elements with superimposed input stimuli of randomly paired orientations. Solid, dashed: ABS learning rule; doffed,dot dashed: one-threshold Hebb-like learning rule (Ca = C1). ences, we use a network comprising 32 oscillatory elements. We normalize the superimposed distributions of input activity of the stimulus pairs to the same maximum activity as that of a single stimulus. Training with these randomly paired stimuli leads to an equivalent weight distribution of synchronizing and desynchronizing connections as the training with

673

Stimulus-Dependent Oscillatory Responses

A

B

90'

135'

0'

45'

@

time

time

Figure 4: Stimulus-dependent assembly formation. (A) 16 oscillators representing orientation-selective neuronal populations with closely overlapping receptive field locations. (B) Stimulus conditions of one single (left) and two superimposed (right) light bars. (C) Distributions of input activity representing the two stimulus conditions in (B). (D) Activity traces from the two units hatched in (A). (E) Mean normalized auto- (dashed) and cross- (solid) correlations of activity from the units shown in (D). Mean of 20 epochs of 20 T; T , period length of isolated oscillator. Normalization by geometric mean of the two auto-correlations. Parameters: input activity as specified in (C), weight distributions zu$) and zug) as shown in Figure 3C, other parameters as in Schillen and Konig (1991).

674

P. Konig, B. Janosch, and T. B. Schillen

single stimuli (Fig. 3D, thick lines). As expected, testing the network after training with (1) the single stimulus and (2) the two superimposed orthogonal stimuli shows again the synchronization and desynchronization of oscillatory assemblies in agreement with the experimental observations (Engel et al. 1991). Next, we use the previous simulation to assess the dependence of the obtained connectivity pattern on the presence of the two thresholds of the ABS learning rule. For this purpose, we establish a one-threshold learning rule by choosing Co = CI,which renders threshold O1 ineffective. The resulting modification scheme then corresponds to a one-threshold Hebb-like learning rule operating on mean activities a in an oscillatory network. Training the network again with randomly paired stimuli leads to a broader tuning of the resulting weight distributions as compared to training with the two-threshold rule (Fig. 3D, thin lines): Synchronizing connections are maintained to all units in the network and corresponding coupling weights to neighboring units are reduced. Similarly, the distribution of desynchronizing connections is spread over most units, leading to strongly reduced weight contributions for each individual unit. Testing the trained network with the two superimposed orthogonal stimuli shows that with this connectivity pattern the network is no longer able to exhibit the stimulus-dependent segregation of oscillatory assemblies. With the one-threshold learning rule, the increased coupling length of synchronizing connections and the flat weight distribution of desynchronizing connections result in bulk synchronization of all activated oscillators. 5 Conclusions

In this paper, we have described the self-organization of synchronizing and desynchronizing coupling connections in a network of neuronal oscillators. The results presented demonstrate that local learning rules can establish a network that exhibits stimulus-dependent formation and segregation of oscillatory assemblies in agreement with experimental evidence (Engel et al. 1991). With respect to known physiological data it is particularly interesting that these learning rules do not need to incorporate any assumptions as to the oscillatory character of the networks stimulus responses. During training the employed learning rules make use of a measure of mean neuronal activity, integrated on a long time scale, to develop a network that is able to operate in its functional response on rather short time scales. This indicates that during ontogenesis physiologically related local modification schemes can suffice to develop a connectivity pattern in support of the observed temporal structure of stimulus responses in cat visual cortex. The simulations presented above show that the two-threshold synaptic modification rule found in in vitro preparations of rat visual cortex

Stimulus-DependentOscillatory Responses

675

(Artola et al. 1990; Fig. 2, solid) is well suited to account for the development of synchronizing connections in our network. Synaptic weights are increased for those connections for which the activity u of the postsynaptic unit attains the learning rule's potentiation domain by exceeding upper threshold 02 (0, < ii). If the postsynaptic activity falls into the depression domain (0, < ii 5 the weights of the pertaining connections are decreased. With only little postsynaptic activity (ii < &), synaptic weights are subject to a slow decay, leading to an elimination of those connections that consistently fail to activate their postsynaptic target cells. This modification scheme stabilizes synchronizing connections between units, the orientation preferences of which differ by not more than 22.5" (Fig. 3A). The resulting distribution of synchronizing connections is the basis for the representation of a stimulus by an assembly of synchronously oscillating cells. Since the tuning width of the weight distribution w:) corresponds to only half the orientation tuning assumed for each neuronal population, the obtained connectivity pattern avoids to completely synchronize the entire network without compromising coarse coding of stimulus orientation. Generalizing the described ABS learning rule to anti-ABS weight changes for desynchronizing connections (Fig. 2, dashed) leads to a bipartite weight distribution wg) (Fig. 3B). In this case, desynchronizing connections between units with similar orientation preferences are depressed, while those between units differing 22.5', . . .,56.25' in their preferred orientations are potentiated. The resulting distribution of desynchronizing connections is the basis for the uncorrelated activity between two assemblies representing two superimposed stimulus bars of orthogonal orientation (Schillen and Konig 1991). Training our network with a one-threshold Hebb-like learning rule results in an increased coupling length of synchronizing connections and reduced weight contributions of desynchronizing connections (Fig. 3D). With such a connectivity pattern the network exhibits bulk synchronization of all activated oscillators. Thus, the self-organization with the one-threshold Hebb-like learning rule cannot establish the stimulusdependent segregation of oscillatory assemblies in our network. The physiological evidence for the ABS learning rule has been established for excitatory synapses on excitatory cells (Artola et al. 1990). At present, the evidence for plasticity of excitatory synapses on inhibitory cells is only indirect (BuzsAki and Eidelberg 1982; Kairiss etal. 1987; Taube and Schwartzkroin 1987). Nevertheless, we consider our application of the ABS modification rule to the network's synchronizing connections to be justified since excitatory synapses on inhibitory cells have many features in common with those on excitatory cells (Sah et al. 1990). Currently, there is no physiological evidence for the anti-ABS learning rule which we apply to the desynchronizing connections in our network. However we postulate anti-Hebbian mechanisms for several reasons. Reductions of synaptic weight in response to associated pre- and postsynaptic neuronal

676

I? Konig, B. Janosch,and T. B. Schillen

activity are present within the depression domain of the ABS modification scheme (Artola et al. 1990). Similarly, a decrease of synaptic weights associated with postsynaptic activity has been found in several paradigms of long-term depression (It0 1989; Stanton and Sejnowski 1990). Modulatory systems could conceivably gate anti-ABS synaptic modifications by adjusting membrane potentials or thresholds determining synaptic depression and potentiation. In our simulations, we use anti-ABS weight changes as a canonical extension of the ABS learning rule, appropriate to establish connections between units with uncorrelated activity. Similarly, other authors have introduced anti-Hebbian learning for the purpose of decorrelating activity in their networks (Barlow 1989; Rubner and Schulten 1990; Foldihk 1990; Hopfield et aZ. 1983). We did not include synaptic modifications of the network's inhibitory connections in our investigation since no plasticity of inhibitory synapses has yet been found. A learning rule with a depression and a potentiation domain of synaptic weights has been introduced by Bienenstock et al. (1982) for the modeling of deprivation experiments. The single threshold that separates depression and potentiation domain is adjusted as a nonlinear function of mean postsynaptic activity. This dynamic threshold adjustment is used to maintain, on average, constant total synaptic: weights of units in the network. In an oscillatory network, a corresponding dynamic threshold adaptation would have to generalize the algorithm of Bienenstock et al. (1982) to a functional dependence on global correlations in the network. However, such an approach would give up the principle of locality proposed for the learning rules in the current investigation. Therefore we ensure the convergence of total synaptic weights by normalization (von der Malsburg 1973). The applied learning rule involves the time averaging of pre- and postsynaptic activity. This is done as neuronal plasticity involves mediators with long time constants as, for example, Gaff and IP3 (Berrigde and Irvine, 1989). However, the exact sequence of synchrony detection, corresponding to the multiplication of pre- and postsynaptic activity in our learning rule, and time averaging, corresponding to the integral over a period of 200 msec, needs further experimental investigation: Stanton and Sejnowski (1990) provide evidence for a fast detection of synchrony in in vitro recordings of hippocampal slices. On the other hand a temporal contiguity of visual stimuli presented to both eyes of 200 msec has been found to be sufficient for the maintenance of binocularity in cat visual cortex (Altmann et al. 19871, a process subject to plastic changes in paradigms as alternating monocular deprivation. These two results demonstrate that the exact sequence of synchrony detection and temporal averaging for the paradigm under consideration has yet to be established. Furthermore, in the developing cortex incomplete myelinization leads to long conduction delays. Thus it may not be possible to synchronize oscillatory signals on a fine temporal scale, although connections appropriate for this task have to be developed. That the self-organization of such

Stimulus-DependentOscillatory Responses

677

a network, suited for stimulus-dependent assembly formation on a fine temporal scale, is possible by using only response properties on a temporal scale of a few hundreds of millisecond is demonstrated by our results. Hartmann and Driie (1990) presented a network model that represents continuous line segments by neuronal assemblies defined by correlated activity. They also report on an approach for the self-organization of synchronizing connections by a Hebb-like learning rule. Since Hartmann and Driie do not aim at simulating the experimentally observed segregation of superimposed stimuli (Engel et ~2.1991;Schillen and Konig 1991) their model is not designed to cope with this problem. As a consequence, Hartmann and Driie do not address the self-organization of a desynchronizing mechanism appropriate for this task. The self-organization of neural networks with oscillatory activity has also been investigated in models of the olfactory bulb (Freeman et al. 1988; Taylor and Keverne 1991). In a conditioning paradigm of odor discrimination, Taylor and Keverne (1991) use a Hebb-like learning rule to associate a frequency shift of oscillatory responses with a conditioned pattern. This frequency shift contrasts with the behavior of the model presented in this paper. Our system of delay-coupled units with nonlinear output functions exhibits only a weak dependence of the frequency of oscillatory responses on the coupling weights (Konig and Schillen 1991). Thus in our network, synaptic weight changes induced by learning do not interfere with the correlation of activity between units. An investigation of the ABS learning rule (Artola et al. 1990) with respect to associative memory has been presented by Hancock et al. (1991). Hancock et al. discuss the error correcting properties of the two-threshold learning rule in comparison to error backpropagation and conventional Hebbian learning algorithms. Wang et al. (1990) addressed the discrimination of superimposed patterns by temporal structure of activity in an associative memory of neuronal oscillators. In their network, synchronizing and desynchronizing coupling connections are implemented by excitatory and inhibitory connections, originating concurrently from the same unit. A Hebb-like learning rule is applied to both types of connections, which are allowed in addition to change their excitatory and inhibitory specificities. To circumvent these physiologically implausible assumptions requires the introduction of additional interneurons, which would, however, affect the temporal characteristics of the network. These and other considerations (Konig and Schillen 1991; Schillen and Konig 1991) have determined our choice of synchronizing and desynchronizing coupling connections, and the introduction of anti-ABS weight changes for the desynchronizing connections. Kleinfeld and Sompolinsky (1989) have investigated the temporal structure of neuronal activity in a modified Hopfield network. In their application two types of coupling connections with different time scales are used to learn stereotyped sequences of activity patterns, as observed

678

P. Konig, B. Janosch, and T. B. Schillen

in biological central pattern generators. Connections operating on a short time scale serve to store the activity patterns of the memory. Connections with a long time scale represent the transitions between the consecutive patterns of a sequence. In this way, the network can learn temporal sequences by a local Hebb-like learning rule. Another application of the temporal structure of neuronal activity to learning and recall in an associative memory has been described by Abbott (1990). In this application, switching the network between stationary and oscillatory modes of activity, on different time scales, is used for initiating and terminating learning in the memory. Similarly, the effect of different time scales on the learning in neural networks has been studied by Baldi and Pineda (1991). These authors discuss switching between supervised and unsupervised learning in oscillatory and nonoscillatory networks. In our current investigation we used simulated light bars as stimuli for the self-organization of synchronizing and desynchronizing coupling connections between neuronal oscillators. Integrating stimulus responses on a long time scale during learning, the network developed the appropriate functional behavior for responses on short time scales. If trained with single stimulus bars (Fig. 3C), ABS rule learning established synchronized activity within the oscillatory assembly that represented the stimulus. At the same time, anti-ABS weight changes developed desynchronizing connections that allowed the segregation of superimposed but distinct additional stimuli. Furthermore, if trained with two superimposed stimulus bars of randomly paired orientations (Fig. 3D), the network still detected the single bar as the underlying coherent stimulus, generalizing over the random combinations of the two presented stimulus bars. Contrastingly, training the network with a one-threshold Hebb-like learning rule could not establish the stimulus-dependent segregation of oscillatory assemblies. In general, in a natural environment, the temporal structure of which neuronal responses would have to be synchronized and desynchronized in order to constitute neuronal assemblies appropriate for the representation of the visual field is not clear, a priori. Here, the self-organization of the neuronal network during ontogenesis provides a means to develop adequate coupling connections in response to the interaction with the environment. The coherent presentation of stimulus features, integrated on a long time scale during learning, can then lead to the development of synchronizing and desynchronizing connections, which are appropriate for the formation and segregation of neuronal assemblies on a short time scale in the mature organism. The work presented in this paper indicates that physiologically related local learning rules suffice to establish neuronal networks that, after training, exhibit a stimulus-dependent assembly formation of oscillatory responses in agreement with the observations in cat visual cortex.

Stimulus-Dependent Oscillatory Responses

679

Acknowledgments It is our pleasure to thank Wolf Singer and Alain Artola for valuable discussions and comments on the first draft of this paper. Renate Ruhl provided excellent graphic assistance. Many thanks also to Neil Steinberg for improving the English. This work has been supported in part by the Deutsche Forschungsgemeinschaft (SFB 185).

References Abbott, L. F. 1990. Modulation of function and gated learning in a network memory. Proc. Natl. Acad. Sci. U.S.A. 87, 9241-9245. Altmann, L., Luhmann, H. J., Greul, J. M., and Singer, W. 1987. Functional and neuronal binocularity in kittens raised with rapidly alternating monocular occlusion. J. Neurophysiol. 58, 965-980. Artola, A., Brocher, S., and Singer, W. 1990. Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature (London) 347, 69-72. Baldi, P., and Pineda, F. 1991. Constrastive learning and neural oscillations. Neural Comp. 3, 526-545. Barlow, H. 1989. Unsupervised learning. Neural Comp. 1, 295-311. Berridge, N. J. and Irvine, R. F. 1989. Inositol phosphates and cell signaling. Nature (London) 341, 197-205. Bienenstock, E. L., Cooper, L. N., and Munro, I? W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 3248. Buzsdki, G., and Eidelberg, E. 1982. Direct afferent excitation and long-term potentiation of hippocampal interneurons. J. Neurophysiol. 48, 597-607. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Engel, A. K., Konig, P., and Singer, W. 1991. Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. U.S.A. 88, 91369140. Foldidk, P. 1990. Forming sparse representations by local anti-Hebbian learning Biol. Cybern. 64, 165-170. Gray, C. M., and Singer, W. 1987. Stimulus-specificneuronal oscillations in the cat visual cortex: A cortical functional unit. SOC.Neurosci. Abstr. 13(404.3). Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1702. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Freeman, W. J. 1975. Mass Action in the Nervous System. Academic Press, New York.

680

I? Konig, B. Janosch, and T. B. Schillen

Freeman, W. J., Yao, Y., and Burke, B. 1988. Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks 1, 277-288. Hancock, P. J. B., Smith, L. S., and Phillips, W. A. 199'1. A biologically supported error-correcting learning rule. Neural Comp. 3, 201-212. Hartmann, G., and Driie, S. 1990. Self organization of a network linking features by synchronization. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G. Hartmann, and G. Hauske, eds., pp. 361-364. Elsevier, Amsterdam. Hopfield, J. J., Feinstein, D. I., and Palmer, R. G. 1983. "Unlearning" has a stabilizing effect in collective memories. Nature (London) 304, 158-159. Ito, M. 1989. Long-term depression. Annu. Rev. Neurosci. 12, 85-102. Kairiss, E. W., Abraham, W. C., Bilkey, D. K., and Goddard, G . V. 1987. Field potential evidence for long-term potentiation of feed-forward inhibition in the rat dentate gyms. Brain Res. 401, 87-94. Kleinfeld, D., and Sompolinsky, H. 1989. An associative network model for central pattern generators. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds. MIT Press, Cambridge, MA. Konig, P., and Schillen, T. B. 1990. Segregation of oscillatory responses by conflicting stimuli - Desynchronizing connections in neural oscillator layers. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G. Hartmann, and G. Hauske, eds., pp. 117-120. Elsevier, Amsterdam. Konig, P., and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp. 3, 155166. Rubner, J., and Schulten, K. 1990. Development of feature detectors by selforganization. Biol. Cybern. 62, 193-199. Sah, l?, Hestrin, S., and Nicoll, R. A. 1990. Properties of excitatory postsynaptic currents recorded in vitro from rat hippocampal interneurons. I. Physiol. 430,605-616. Schillen, T. B. 1991. Designing a neural network simulator - The M E N 5 modelling environment for network systems 11. Comp. Appl. Biosci. 7, 431446. Schillen, T. B., and Konig, P. 1990. Coherency detection by coupled oscillatory responses - Synchronizing connections in neural oscillator layers. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G . Hartmann, and G. Hauske, eds., pp. 139-142. Elsevier, Amsterdam. Schillen, T. B., and Konig, P. 1991. Stimulus-dependent assembly formation of oscillatory responses: 11. Desynchronization. Neural Comp. 3, 167-177. Singer, W. 1990. Search for coherence: A basic principle of cortical self-organization. Concepts Neurosci. 1, 1-26. Stanton, I? K., and Sejnowski, T. J. 1990. Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature (London) 339,215-218. Taube, J. S., and Schwartzkroin, P. A. 1987. Intracellular recording from hippocampal CAI interneurons before and after development of long-term potentiation. Brain Res. 419,32-38. Taylor, J. G., and Keverne, E. B. 1991. Accessory olfactory learning. Biol. Cybern. 64, 301-305.

Stimulus-Dependent Oscillatory Responses

681

von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2, Max-Planck-Institute for Biophysical Chemistry, Gottingen, Germany. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 2940. Wang, D., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2, 94-106. ~~

~

Received 5 August 1991; accepted 16 March 1992.

This article has been cited by: 2. Walter G. Sannita. 2009. Neuronal functional diversity and collective behaviors: a scientific case. Cognitive Processing 10:S1, 17-22. [CrossRef] 3. Walter G. Sannita. 2008. Neuronal Functional Diversity and Collective Behaviors. Journal of Biological Physics 34:3-4, 267-278. [CrossRef] 4. Antonino Raffone , Gezinus Wolters . 2001. A Cortical Mechanism for Binding in Visual Working MemoryA Cortical Mechanism for Binding in Visual Working Memory. Journal of Cognitive Neuroscience 13:6, 766-785. [Abstract] [PDF] [PDF Plus] 5. H. J. Kappen. 1997. Stimulus-dependent correlations in stochastic networks. Physical Review E 55:5, 5849-5858. [CrossRef] 6. Ruby Klink, Angel Alonso. 1997. Morphological characteristics of layer II projection neurons in the rat medial entorhinal cortex. Hippocampus 7:5, 571-583. [CrossRef] 7. Pieter R. Roelfsema, Andreas K. Engel, Peter König, Wolf Singer. 1996. The Role of Neuronal Synchronization in Response Selection: A Biologically Plausible Theory of Structured Representations in the Visual CortexThe Role of Neuronal Synchronization in Response Selection: A Biologically Plausible Theory of Structured Representations in the Visual Cortex. Journal of Cognitive Neuroscience 8:6, 603-625. [Abstract] [PDF] [PDF Plus] 8. H. Sompolinsky , M. Tsodyks . 1994. Segmentation by a Network of Oscillators with Stored MemoriesSegmentation by a Network of Oscillators with Stored Memories. Neural Computation 6:4, 642-657. [Abstract] [PDF] [PDF Plus]

Communicated by Vincent Torre

Seeing Beyond the Nyquist Limit Daniel L. Rudemian* William Bialek* Department of Physics and Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, C A 94720 U S A

In many biological systems the primary transduction of sensory stimuli occurs in a regular array of receptors. Because of this discrete sampling it is usually assumed that the organism has no knowledge of signals beyond the Nyquist frequency. In fact, higher frequency signals are expected to mask the available lower frequency information as a result of aliasing. It has been suggested that these considerations are important in understanding, for example, the design of the receptor lattice in the mammalian fovea. We show that if the organism has knowledge of the probability distribution from which the signals are drawn, outputs from a discrete receptor array can be used to estimate signals beyond the Nyquist limit. In effect, a priori knowledge can be used to de-alias the image, and the estimated signal above the Nyquist cutoff is in fact coherent with the real signal at these high frequencies. We address initially the problem of stimulus reconstruction from a noisy receptor array responding to a Gaussian stimulus ensemble. In this case, the best reconstruction strategy is a simple linear transformation. In the more interesting (and natural) case of nongaussian stimuli, optimal reconstruction requires nonlinear operations, but the higher order correlations in the stimulus ensemble can be used to improve the estimate of super-Nyquist signals.

1 Introduction

Sensory stimuli reaching an organism are initially encoded in the activity of an array of sensory neurons. From this discrete pattern of activity and a priori information about the stimulus ensemble, the organism must be able to reconstruct important aspects of the stimulus. Since the animal's performance will be limited by the properties of the receptor array, it is important that its design take full advantage of any knowledge of the stimulus ensemble. 'Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 08540. Neural Computafion 4, 682-690 (1992)

@ 1992 Massachusetts Institute of Technology

Seeing Beyond the Nyquist Limit

683

Each element in the receptor array encodes information about a particular subset of the stimulus. In the cochlea, for example, each hair cell responds to a limited frequency range. In the retina each photoreceptor has a limited aperture. Presumably different arrangements of these apertures result in receptor signals that convey more or less information about the outside world. In particular, the combination of receptor sampling, receptor noise, and the statistics of natural signals can lead to nontrivial optimization problems for the design of the receptor array. One example is the compound eye, where some insects sacrifice angular resolution in favor of collecting more photons per receptor and hence providing better intensity resolution; these tradeoffs can be understood semiquantitatively in terms of information theoretic optimization principles (Snyder et al. 1977). In the compound eye one can easily demonstrate that the discreteness of the photoreceptor array leads to aliasing: Behavioral and neural reactions to moving gratings are reversed if the spatial frequency of the grating exceeds the Nyquist limit. But gratings are highly unnatural stimuli. In this work we ask how the constraints of discrete sampling limit performance under conditions where the stimuli are chosen at random from some “natural” distribution that is presumed known to the organism. We find that it is in fact possible to reconstruct meaningful information beyond the Nyquist limit, and comment on the implications of this result for retinal design. 2 The Receptor Array

We will consider an infinite array of receptors lying on a one-dimensional lattice at the points x, = no, where a is the receptor spacing and n is any integer. Each element has the same receptive field profile or aperture function f ( x ) centered at the location of that element. The output of the nth element is

Jw

(2.1) + vn where $ ( x ) is the stimulus and v,,is the receptor noise (see Fig. 1). We yn =

-w

d x f ( x - x n ) 4(x)

shall assume the noise is additive, gaussian, and independent in each channel, (qmvn)= cr2Smn. Such an array will act as a simple approximation to a system of sensory cells, which in general have nonlinearities as well as nontrivial temporal responses. 3 Stimulus Reconstruction

The organism is faced with the problem of estimating various features of the stimulus $(x) using the receptor cell outputs {y,}. Everything that we know about $ ( x ) by virtue of observing the {y,} is summarized by

Daniel L. Ruderman and William Bialek

684

Neb.

Nois.

Activity

Activity

$-

Nola0

Activity

Figure 1: A stimulus periodically sampled with noisy filters. the posterior probability distribution P[$ I {yll}]. Using Bayes' theorem we have

Given that we have seen the activities {y,} it is this distribution that tells us the best 4 to choose as the original stimulus. Note that P[{y,,}]acts only as a normalization factor since it is assumed that the activities {y,} are known. The important aspects of this distribution are characterized by W Y n } I 41 and PMI. Since the noise is gaussian and independent in each channel we have

where

F,,[$] = /dxf(x

-

x,) $(x) =

1;

e?k""f(k) $ * ( k )

(3.3)

is the filter activity uncorrupted by noise. We define $ ( k ) , the Fourier transform of $(x), a s

4 ( k ) = /dxe-'kXd(x)

(3.4)

The question of choosing the distribution of the stimuli, P[$], is more complicated. As a simple first step we shall assume the stimuli are drawn

Seeing Beyond the Nyquist Limit

685

from a gaussian ensemble characterized by a (two-sided) power spectrum S(k), so (3.5) The treatment of nongaussian stimuli will be addressed below. Gaussian stimuli and noise lead to a gaussian distribution for P[4 I {yn)l:

The mean of the distribution is &(k) and its width is determined by C(k, k‘). From the distribution Pj4 1 {yn}] we must choose some particular 4Jk) that is the ”best” estimate of the actual signal. The most useful estimator depends on the costs for making different kinds of errors, but there are two natural choices: maximum likelihood, in which we find the 4e(k) that satisfies

(3.7) and the conditional average,

$t(k)

=

10 4 ~ 1I 4

{~n)14(k)

(3.8)

Here J D4 means a functional integral over all signals 4(x). In the simple gaussian case considered here both approaches lead to the same result, (3.9) which is just &(k), the mean of the distribution P [ 4 I {yn}].Here ko = 2r/a and y(k) is the Fourier transform of the receptor outputs; since the yn lie on a lattice we have y(k) = y(k + Mko) for all integer M. Thus the most likely stimulus can be derived from the receptor activities through a linear transformation. In principle there is no limit to the spatial frequency of the reconstruction. The accuracy is limited by noise, filter strength, and aliasing among frequencies separated by integer multiples of the lattice wavevector. A formula analogous to equation 3.9, derived as the linear filter that minimizes reconstruction error, has been known since the 1950s (Stewart 1956). With any signal estimate it is important to know its degree of confidence. In the current example this is given by the width of the distri-

686

Daniel L. Ruderman and William Bialek

bution of & ( k ) , which is characterized by the correlation function of the distribution: kof(k)S (k)f * (k’)S(k:’)C,,6(k -k‘ - t~ko) C(k.k’) = 2 ~ S ( k ) O (k k’)(3.10) 02+( k o / 2 ~C ) , lf(k+nko)12S(k+nko) This correlation function is not diagonal when there is aliasing; the Fourier components become coupled and thus covary. When the noise variance is small and aliasing is absent the formula reduces to (3.11) The variance goes to zero as the noise diminishes, thus increasing the reliability of the estimate. This correlation function plays an important role in the estimation of nongaussian stimuli. 4 Examples

Consider the simplest case of sampling. The signal is band-limited to below the Nyquist sampling frequency ko/2, and the sampling is done by noiseless delta-function filters [o= 0, f ( k ) = 11. Then the reconstruction is exact and takes the familiar Shannon-Whittaker form (Shannon 1948) (4.1) A de-aliasing reconstruction example is shown in Figure 2. The signal (solid line) is drawn from a white noise distribution with a maximum frequency of 6 cycles. Such a signal requires at least 13 (= 2 x 6 1) sampling points to reconstruct exactly. In this case it is sampled at 11 points (diamonds) by noiseless delta-function filters and reconstructed based on these samples. The first reconstruction (dotted line) is done using the Shannon-Nyquist formula and displays a classic case of aliasing. The second reconstruction (dashed line) uses the optimal filter derived above, which takes into account the actual power spectrum. This a priori knowledge allows the filter to de-alias the signal, producing a more accurate reconstruction.

+

5 What‘s going on?

To understand why we can estimate signals above the Nyquist frequency it is helpful to look at a simple example. Consider a signal whose power spectrum includes the frequencies k and k + k,, but none higher. The reconstruction gives, in the limit of large signal-to-noise ratio, (5.1)

Seeing Beyond the Nyquist Limit

687

Figure 2: Stimulus reconstruction showing aliasing and de-aliasing. Solid line, signal; diamonds, sample points; dotted line, reconstruction with aliasing; dashed line, reconstruction using a priori statistics. See text. Thus the ratio of the reconstructed stimuli at the two frequencies is

Mk) + kn)

&(k

-

S ( k ) f(k)

S(k + kn)f(k + kn)

(5.2)

The signal estimate comes from partitioning the aliased signals according to their relative power spectra and filter strengths. In this way, by knowing the power spectrum, the signal can be estimated at frequencies beyond the Nyquist limit. 6 Is Phase Preserved in Super-Nyquist Frequencies?

In order for a reconstruction to be useful there must be phase coherence between the signal and the estimate. The mean squared error in estimating the signal amplitude at a frequency k is (I$e(k) - d(k)12) =

(I$t-(k)12)+ (t$(k)12) - 2 ~ e ( $ ~ , * k M ( k ) )

(6.1)

688

Daniel L. Ruderman and William Bialek

The last term involves the cosine of the phase difference between the signal and its estimate. Clearly the larger the phase coherence, the better the estimate will be on average since the error will be minimized. To quantify the preservation of phase we examine the coherence between the estimate and the signal: (6.2) Here ( $ * ( k ) & ( k ) ) is an average over all signals and noise. The dependence of &(k) on the signal is through y(k). We find that (6.3)

Thus there is positive covariance at a frequency as long as there is signal power and the filter will pass the frequency. The covariance is reduced by aliasing and by the finite level of receptor cell noise 0,but both of these are continuous, gradual effects. Just as the reconstruction of the stimulus does not fail abruptly when the signal-to-noise ratio falls below unity, similarly the reconstruction does not fail abruptly as soon as aliasing is possible. It is interesting to consider what happens in the case of nongaussian stimuli. Here standard methods of statistical physics can be used to calculate conditional averages as a perturbation series in the higher order correlations, which will involve the correlation function of the posterior distribution. The result is that our best estimate of a nongaussian stimulus involves nonlinear processing of the receptor outputs even if the input/output relations of the receptors themselves are linear, as assumed here. Nongaussian behavior, however, means that different Fourier components of the stimulus are correlated with one another, unlike the case of the (stationary) gaussian distribution. In natural images, whose statistics are scale-invariant (Field 1987) this means that high- and low-frequency signals are correlated, and of course this helps in the reconstruction of super-Nyquist data.

7 Conclusions An organism is faced with the problem of estimating stimuli from noisy neural responses. We have shown that a priori knowledge of the stimulus ensemble leads to meaningful stimulus estimation beyond the naive Nyquist limit. Such applications of prior statistical knowledge are widely used in regularizing ill-posed inverse problems (Turchin et al. 1971). Although we have worked a one-dimensional example, the theory is readily extended to higher dimensions. An interesting follow-up to this problem is to consider the choice of optimal sampling filter or ”receptive field,” f ( x ) . A given power spectrum

Seeing Beyond the Nyquist Limit

689

and system design constraints lead to a variational problem that can be solved for the optimal filter. Here "optimal" may be defined in terms of minimizing the mean-squared error in reconstruction, for example, or accurately detecting a particular stimulus feature. Is any of this relevant to biology? Several years ago a number of authors were concerned with the consequences of aliasing for mammalian vision (Yellot 1982, 1983, 1984; Miller and Bernard 1983; Hirsch and Hylton 1984; Bossomaier et ai. 1985). Questions were raised about the role of disorder in the receptor lattice, which might serve to reduce aliasing, as opposed to the more conventional view that our remarkable acuity [and hyperacuity (Westheimer 198l)l is based on a highly ordered receptor array. From equation 6.3 we suggest that this debate missed a crucial point. Under natural conditions the effect of aliasing is essentially the same as that of noise-it causes a gradual reduction in the quality of our estimates of the true image at high spatial frequencies. If the noise is very large, in fact, aliasing is really not a problem at all: From equation 6.3 we see that at high noise levels the gradual decline in correlations at higher frequencies is governed solely by the signal-to-noise ratio, and the confusion triggered by aliasing has no effect. Recent experiments on the photocurrents and noise produced by cones from the primate fovea (Schnapf etal. 1990) can be analyzed to show that the noise level is in fact quite high: The equivalent contrast noise per pixel is of order 30% if one assumes an integration time appropriate to preattentive vision (W. Bialek, unpublished). We suggest that, as a result of the high noise level, antialiasing is not a major design constraint in the mammalian retina. It would be interesting to design experiments that would test the ability of human or animal observers to "de-alias" natural signals in the sense described here.

Acknowledgments We thank J. S. Joseph, J. I? Miller, and F. M. Rieke for helpful discussions. Work in Berkeley was supported in part by a Presidential Young Investigator award from the National Science Foundation (to W. B.), supplemented by funds from Sun Microsystems, Cray Research, and the NEC Research Institute, and by a graduate fellowship from the Fannie and John Hertz Foundation (to D. L. R.). References Bossomaier, T. R. J., Snyder, A. W., and Hughes, A. 1985. Irregularity and aliasing: solution? Vision Res. 25, 145-147. Field, D. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. SOC.Am. 4, 2379.

690

Daniel L. Ruderman and William Bialek

Hirsch, J., and Hylton, R. 1984. Quality of the primate photoreceptor lattice and limits of spatial vision. Vision Res. 24, 347-355. Miller, W.H., and Bernard, G. D. 1983. Averaging over the foveal receptor aperture curtails aliasing. Vision Res. 23, 1365-1369. Schnapf, J. L., Nunn, B. J., Meister, M., and Baylor, D. A. 1990. Visual transduction in cones of the monkey Macaca fascicularis. 1, Physiol. 427, 681-713. Shannon, C.E. 1948. A mathematical theory of communication. Befl Sys. Tech. J. 27, 379. Snyder, A. W., Stavenga, D. G., and Laughlin, S. B. 1977. Spatial information capacity of compound eyes. J. Comp. Physiol. 116,183-207. Stewart, R. M. 1956. Statistical design and evaluation of filters for the restoration of sampled data. Proc. IRE 44, 253-257. Turchin, V. F., Kozlov, V. P., and Malkevich, M. S. 1971. The use of mathematicalstatistics methods in the solution of incorrectly posed problems. Soviet Phys. Uspekhi. 13, 681-703. Westheimer, G. 1981. Visual hyperacuity. Prog. Sens. Physiol. 1, 1-30. Yellot, J. I., Jr. 1982. Spectral analysis of spatial sampling by photoreceptors: Topological disorder prevents aliasing. Vision Res. 22, 1205-1210. Yellot, J. I., Jr. 1983. Spectral consequences of photoreceptor sampling in the rhesus retina. Science 221, 382-385. Yellot, J. I., Jr. 1984. Image sampling properties of photoreceptors: A reply to Miller and Bernard. Vision Res. 24, 281-282.

Received 26 November 1991; accepted 31 January 1992.

This article has been cited by: 2. Daniel Ruderman. 1994. The statistics of natural images. Network: Computation in Neural Systems 5:4, 517-548. [CrossRef]

Communicated by Joseph Atick

Local Synaptic Learning Rules Suffice to Maximize Mutual Information in a Linear Network Ralph Linsker IBM Research Division, T. I. Watson Research Center, P. 0.Box 218, Yorktown Heights, NY 10598 USA

A network that develops to maximize the mutual information between its output and the signal portion of its input (which is admixed with noise) is useful for extracting salient input features, and may provide a model for aspects of biological neural network function. I describe a local synaptic Iearning rule that performs stochastic gradient ascent in this information-theoretic quantity, for the case in which the inputoutput mapping is linear and the input signal and noise are multivariate gaussian. Feedforward connection strengths are modified by a Hebbian rule during a "learning" phase in which examples of input signal plus noise are presented to the network, and by an anti-Hebbian rule during an "unlearning" phase in which examples of noise alone are presented. Each recurrent lateral connection has two values of connection strength, one for each phase; these values are updated by an anti-Hebbian rule. 1 Introduction

The idea of designing a processing stage so as to maximize the mutual information (MI) between its output and the signal portion of its input (which is admixed with noise) is attractive as a way to use sensory input optimally, and to extract statistically salient input features (Linsker 1988; Atick and Redlich 1990). For the idea to be practical for use by biological systems or in large synthetic networks, it is important that the required optimization be implemented by a local algorithm-one that uses only information currently available at the node or connection that is to be modified. This paper presents such an algorithm for the case in which the input-output transformation is linear and the signal and noise distributions are multivariate gaussian. The algorithm performs stochastic gradient ascent in the MI. Local network algorithms have been described for several tasks that differ from the present one but are related to it: (1)principal component analysis (PCA) (Foldiik 1989; Leen 1991; Sanger 1989), which identifies high-variance linear combinations of inputs, but does not take account of noise; (2) smoothing and predictive filtering (Atick and Redlich 1991), Neural Computation

4, 691-702 (1992)

@ 1992 Massachusetts Institute of Technology

Ralph Linsker

692

\

Figure 1: Linear network, showing feedforward paths (solid lines) and lateral recurrent connections (dashed lines). which approximate MI maximization in certain limiting cases; and (3) MI maximization in a probabilistic winner-take-all network (Linsker 1989b), a nonlinear case in which only one output node "fires" at any given time, simplifying the computation of the MI. The paper is organized as follows: Section 2 states the optimization problem. The algorithm is presented in Section 3, illustrated by a numerical example in Section 4, and discussed in a broader context in Section 5. Mathematical details are given in the Appendix. 2

The Optimization Problem

A linear feedforward network (indicated by the solid-line paths in Fig. 1) is presented with a sequence of input vectors X. Each vector is the sum of an input signal S and input noise N,where S and N are independently drawn from multivariate gaussian distributions whose means are zero. The network's output is Z f CX + u, where the noise u, added at each output is an independent gaussian random variable of zero mean and nonzero variance. The ensemble-averaged mutual information R ( Z ,S) between Z and S, also called the Shannon information rate for the S Z mapping, is the information that the output Z "conveys about" the input signal S. It equals H ( Z )-H(Z I S), where H ( Z ) denotes the entropy of Z , and H ( Z I S) --f

Local Synaptic Learning Rules

693

denotes the average over S of the conditional entropy of Z given S. [Note that H ( Z I S) equals the entropy of the noise contribution to Z, which is CN v.1 The entropy of a gaussian distribution having covariance Q is (Shannon and Weaver 1949), apart from an irrelevant constant term,

+

H = (1/2)IndetQ

(2.1)

Therefore

R ( Z , S )= H ( Z ) - H ( Z

I S) = (1/2)[IndetQL-lndetQU]

(2.2)

where

+ + + + (2.3) QU + + (2.4) qL = ((S + N ) ( S + N)') = (SS') + (NN'),9' = ( N N T ) ,and Y = (4). QL

= =

([C(S N) v][C(S N) 4') = CqLCT+ r ((CN v)(CN v)') = CqUCT+ r

Both S and N may have statistical correlations. Correlations in the input noise N may arise because of sensor design, or because correlations were induced at earlier stages of a multistage processing network, or because N represents environmental features that we do not want the network to respond to (e.g., "semantic noise"). In any case, if the network is to learn to maximize R ( Z ,S), it must learn to distinguish input signal from noise. We will do this using a training process that consists of two phases. In "learning" or L phase, the network is shown examples of X = S S N . In "unlearning" or U phase, the network is shown examples X = N of input noise alone. Performing gradient ascent on R ( Z ,S) then consists of alternately performing gradient ascent on H ( Z ) during L phase, and gradient descent on H ( Z I S) during U phase. Each of these tasks is performed, during its respective phase, by the local algorithm described in the next section.' 3 Local Synaptic Modification Rule

3.1 Performing Gradient Ascent in the Entropy of the Output Activity. We show first how to perform gradient ascent in the entropy H of the output of a linear mapping. (Derivations are relegated to the Appendix.) Let X be an input vector, Y 5 CX, and Z = Y v be the output. X has covariance matrix 4 = (XX'), and the output Z has covariance

+

'Throughout the paper, the covariance matrix of the input X is denoted by q = (XX'), and that of the output Z = CX + v by Q = (ZZ'). When used, the superscript L or U specifies whether the covariance matrix refers to the "learning" or "unlearning" phase. Since X = S+N during L phase and X = N during U phase, the expressions for qLsU and QLsU used in equations 2.3 and 2.4 are obtained. Angle brackets denote an ensemble average, and superscript T denotes the transpose.

Ralph Linsker

694

Q = ( Z Z T )= CqCT + r. We assume for now that r is independent of C. Using equation 2.1 we obtain aH/aCni

=

(Q-'Cq)ni

(3.1)

If the Q-' factor were absent, we would obtain the gradient ascent "batch update" learning rule: AC,, = raH/aC,, = r(Cq),, = y(Y,X,). The local Hebbian rule AC,, = yY,X, would perform stochastic gradient ascent in H . The Q-' factor, however, introduces a complicated dependence on the activities at all other nodes. To compute equation 3.1 using a local learning rule, we augment the Z (solid lines of Fig. 1) by feedforward network that maps X + Y adding lateral connections of strength F,, (dashed lines) from each node m to each node n (including m = n). The lateral connections do not directly affect the network output Z; they are used only to compute the weight changes AC. We choose the strength matrix to be --f

F=i-aQ

(3.2)

( a > 0, and i is the identity matrix) and recursively define a sequence of activity vectors y(t) by

y(0)

= Y;

Y(t

+ 1) = Y + Fy(t)

(3.3)

If cy is chosen so that y(t) converges (see Appendix), then ay(m) = Q-'Y and we obtain the batch update learning rule Acm = raH/acni

(3.4)

= ra(yn(m)Xt)

The Hebbian rule AC,, = yayn(m)XI, using the iterated activity y(w) rather than Y, performs stochastic gradient ascent in H. The lateral connection strengths F depend on C through Q. An estimate Q of Q is computed as a running average, or trace, over recent input presentations. The initial Q is chosen arbitrarily, and Qnmis updated at each presentation by AQnm =

(l/M)(YnYm + rnm

-

(3.5)

Qnm)

+

(If r is not explicitly known, Z,Z, may be used in place of Y,Ym rnm.1 We define the strength F,, = 6,, - CYQ,,. Thus AF,, contains the antiHebbian term [-(a/M)Y,Y,]. An empirically useful, although not theoretically guaranteed, way to keep a at a proper value for convergence of y(t) is to monitor whether y(t) has "settled down" by a specified time T. For example, define p as the sum over all nodes n (optionally averaged over a running window of recent presentations) of b n ( T + l ) -y,(T)] x [y,(T) -y,(T- l)]. If y(T) has nearly converged to y(w), IpI will be smaller than a specified tolerance 6 . (The converse is not guaranteed.) If p > 6 , y(t) is converging too slowly and N should be increased. If p < - 6 , y(t) is oscillating and (Y should

Local Synaptic Learning Rules

695

be decreased. (An analytic condition for convergence is discussed in the Appendix. That condition cannot directly be used to choose a, since it makes use of the eigenvalues of Q, which are not available to the network.) To summarize, the local algorithm for stochastic gradient ascent in H is: Choose initial C (e.g., randomly) and Q (e.g., = 0). Then do repeatedly the following steps, which we will refer to as ”Algorithm A“:

+ v. 2. Update lateral connections: Change 0 using equation 3.5. 1. Select X; compute Y = CX, Z

=Y

3. Recursive activation: Compute {y(t), t I T

+ 1) using equation 3.3

with F = I - NQ. 4. Convergence check (e.g., using p as above). If fied, go back to step 3.

Q

needs to be modi-

5. Update feedforward connections: Change Cni by ACni

= rayn(T)Xi.

During a start-up period, train Q until it converges, but leave C unchanged, by skipping steps 3-5. 3.2 Gradient Ascent in Mutual Information; ”Learning” and ”Unlearning” Phases. Combining the two training phases (see equations 2.2 and 3.41, we obtain for the batch update rule

ACni = yaR(Z, S)/dCwi = r[aL(yn(m;L)Xi)L- ~ u ( Y ~u )( X~i ); ~(3.6) ] Here (. . . ) d denotes an average over a set of presentations appropriate to phase +that is, X = S + N for L phase, and X = N for U phase. Each iterated activity y,(m; 4)is computed using the lateral connection matrix F d = I - a+Qb appropriate to that phase. Stochastic gradient ascent in R(Z,S) is obtained by starting with arbitrary C, QL,and QU,and repeatedly performing Algorithm A alternately (1) for L phase, updating QL,aL (if necessary), and C; and (2) for U phase, updating QU,QU (if necessary), and C. 3.3 Constraints and ”Resource Costs”. In general, maximizing the above R ( Z ,S ) will cause elements to increase without limit. Some “resource cost” penalty function P, or some explicit constraint on C, is therefore typically included as part of the optimization problem. Maximizing [ R ( Z , S )- P] instead of R ( Z , S ) poses no additional difficulty, provided dP/dC,i can be computed locally (at each node or connection) by the network. TWOuseful cases are P 0: C,(1 -gn)’, where (1)g n = CiC’,j or (2) gn = V,,, and where

v, ZE (Y’,)L = (CqLCT),,

(3.7) is the output variance at node n during L phase (before adding output noise v,).

Ralph Linsker

696

3.4 A Constraint on the Number of Discriminable Output Values at Each Node. Realistic nonlinear processing elements typically have

limited dynamic range. This limits the number of output values that can be discriminated in the presence of output noise. One way to introduce a similar limitation in our linear network is to penalize deviations from V, = 1 as above. In this subsection we explore a different way to limit the effective dynamic range of each node. We multiply the output noise at each node n by the factor V;”, so that increasing V, does not change the number of discriminable output values. That is, Z, = Y, $- v, with v, = VA”v; and (v;,) = P (= constantL2 When this is done, the variance of each v, depends on C (through V,) and contributes additional terms to the gradient of R ( Z ,S). We obtain (see Appendix for derivation of equations 3.8 and 3.10)

(note the term in ,b’ is new). To derive a learning rule for batch update, we must express [(Q4)-lInn in terms of local quantities. To do this, we recursively compute (during each phase 4) ~ ’ ( 0 4) ; = v’;

+

+

~ ’ ( t 1; 4)= V’ F’yy’(t; 9)

(3.9)

The prime on y’ indicates that the input to the recursion is now v’, rather than CX as before. We find

where

To derive a learning rule for update after each presentation, we associate with each node n an estimate or trace W,?of W$ obtained by choosby AW,? = (l/M)[a$y’,(co;4)~;-Wt]. ing an initial W 4 , then updating W,? The learning rule is then as stated following equation 3.6, with the addition that y’(t; 4)is recursively computed and equations 3.10 and 3.11 are used. *We could instead have rescaled the output Y, to unit variance before adding output noise of constant variance @, yielding Z, = V,”2Y, + I(,. The resulting R ( Z ,S) is the same in both cases.

Local Synaptic Learning Rules

697

4 Illustrative Example

The operation of the local learning rule for update after each presentation in the case of constrained dynamic range (previous subsection) is illustrated by a numerical example. Each input signal vector S is a set of values at D uniformly spaced sites on a one-dimensional ”retina” having periodic boundary conditions. S is obtained by convolving white noise with a gaussian filter, so that (SiSj) = exp[-(sij/~a)~] (apart from a negligible deviation due to the boundary condition), where sij = min(/i- j l , D - [i - j l ) . The input noise is white; each component Ni of noise vector N is an independent gaussian random variable of mean zero and variance 7. Thus (NiN,) = 76i,. There are D’ output nodes n. The initial C,i values are drawn independently from a uniform distribution with mean zero. Initial Q(4) and I%,($) are set to zero. Parameter values used are D = 16; D’= 8; so = 3; input noise variance 7 = 0.25; output noise variance 0 = 0.5; y =5x aL = 0.445 to 0.475 (automatically adjusted to keep the for T = 16); (YU = 1; M = 400. C is convergence measure IpJ < t = held fixed for the first 800 input presentations to allow Q4 to converge. During the development of C, a running estimate (trace) of each output variance V, is used to rescale C,i (multiplying it by Vi”2) so that each V,,remains close to unity. [This rescaling of C for each node n is done for convenience, and has no effect on the value of R ( Z ,S), as discussed in the previous subsection.] Note that y was conservatively chosen for slow but accurate convergence of C; no attempt was made to optimize the rate of C development. The resulting weight vector (C,, , . . . , C,D) for each output node n in general spans the entire “retina.” [No penalty for long connections has been included; see Linsker (1989a) for discussion of a similar case.] Since the input covariance matrix is Toeplitz ( ( S S T ) q is a function of i - j), the eigenvectors of ( S S T ) are states (in weight space) having definite spatial frequency k. It is therefore more revealing to show the development of the of c for various n, rather than Fourier components (&, . . . , &, . . . , exhibiting the C,, themselves. Figure 2 shows the squared magnitude lC,k1* for two of the eight nodes n, and the sum of this quantity over all eight nodes, at several stages during training. The summed squared magnitude develops a ”bandpass filter” appearance for the following reasons. For each n, C,i starts as white noise and loses its high spatial frequency components during development, since the input signal is spatially correlated over short distances. The Fourier components of the connection strength also tend to decrease at low spatial frequencies k, where the input signal-to-noise ratio is largest, since devoting a large connection strength to transmitting a high-SNR component would be wasteful of the limited available output variance. Bandpass filter solutions have also been found for similar MI maximiza-

en,)

698

Ralph Linsker

Figure 2: Example of development of Fourier transform C n k of C,; for output nodes n, input sites i, and spatial frequencies k. Plotted are lC,z# (dotted and dashed curves) for two of the eight nodes n, and C,,le,,k(* (solid curve), vs. \k\, at (a) start of development (random C); (b,c) two intermediate times; (d) final state to which C has converged. Corresponding values of R(Z, S) are (a) 2.00, (b) 2.33, (c) 2.87, (d)3.04. See text for parameter values used. tion problems, in special cases where the form of C has been restricted to a translationally invariant Ansatz (e.g., C,, =_ C[(n/D’)- i/D]) [cf. Linsker (1989a) and Atick and Redlich (1990)l. 5 Discussion

It is useful and perhaps striking that learning rules constructed from simple Hebbian and anti-Hebbian modification of feedforward and lateral connections can perform gradient ascent in the mutual information to any desired accuracy, with no need for additional network complexity or nonlocal rules. The maximization of mutual information appears to have value for extracting statistical regularities, building ”feature” analyzers, and generating fruitful comparisons with biological data. For synthetic networks, the existence of a local algorithm increases the likelihood of feasible and efficient hardware implementations. For biological networks, such an algorithm is crucial if the proposed optimality principle or some variant of it is to be seriously considered as a candidate for a general task that a neural processing stage may learn to perform (Linsker 1988).

Local Synaptic Learning Rules

699

5.1 Relation to PCA. The optimization principle is related to PCA in the special case that the input noise covariance (A")= 71 and output noise variance ,4 + 0. Then the D' output nodes develop to span a D'-dimensional leading PCA subspace of the input space. Although the present MI-maximization algorithm is necessarily more complex, it resembles in some respects the PCA algorithm proposed by Foldihk (1989). An extension of PCA that is of practical interest for improving image SNR (Green et al. 1988) corresponds to the case in which the input noise N may have arbitrary correlations and one wants to select those components of the input that have maximum SNR in order to reconstruct a lessnoisy version of the input. The present work provides a local network algorithm suitable for selecting the appropriate components. 5.2 More General Distributions and Mappings. In the case of gaussian-distributed inputs to a linear processing stage, treated here, MI maximization strikes a balance between (1) maximizing the fraction of each node's output variance that reflects variance in the input signal as opposed to the input noise; (2) removing correlations among different nodes' output signal values, thereby reducing redundancy (Barlow 1989), when the output noise variance P is small; and (3) introducing redundancy to mitigate the information-destroying effects of output noise when P is large. The present algorithm can be applied to signal and noise distributions that are not multivariate gaussian, as well as to multistage processing systems in which a nonlinear transformation may be applied between stages. In these more general cases, the algorithm can be expected to strike a balance among the same three goals, although in general the MI will not be thereby maximized (since the MI reflects higher order correlations as well). The present learning algorithm, and extensions thereof, are likely to be well suited to nonlinear multistage networks in which one wishes to analyze low-order statistical correlations at each stage, with the goal of extracting higher-order featural information after a sequence of stages. 5.3 Lateral Connectivity and the Role of Selective "Unlearning". Two features of the lateral connectivity in the present work are of interest. First, the lateral connection strengths F,, depend on the correlation between output activities. (They may in general also depend on the distance between output nodes; we have omitted this for simplicity.) Second, two phases-"learning" and ''unlearning''-are used for distinguishing input signal from input noise, both of which may have correlational structure. The lateral connection strength depends on which phase the network is in at a given time. It would be of great interest to know to what extent biological learning rules and network architectures may exhibit qualitatively similar features.

700

Ralph Linsker

Two-phase learning, with input signal present in one phase and absent in the other, has played a role in some earlier work directed toward other goals.

1. A Boltzmann machine learns to generate activation patterns over a subset of nodes whose probability distributions are as nearly as possible the same during two phases: (i) when the nodes' activities are clamped by the inputs, and (ii) when they are free-running (Hinton and Sejnowski 1983). 2. "G-maximization" produces a single node whose output is maximally different, in the presence of input correlations, from what it would be if its inputs were uncorrelated (Pearlmutter and Hinton 1986). 3. The AR-Prule for reinforcement learning (Barto and Anandan 1985) uses a rule similar to Hebbian update when the output is "successful," and anti-Hebbian update otherwise. We may draw an analogy between reinforcement learning and the present work on unsupervised learning: "Success" may indicate that the network is responding to combinations of input that are relevant to the intended goal (hence to combinations that constitute input "signal"), while "failure" may indicate that the input combinations that evoked the response constitute "semantic noise." Extensions of the present algorithm along these lines may be applicable to reinforcement learning. 4. Finally, Crick and Mitchison (1983) have suggested that dream sleep may serve a selective "unlearning" role, by suppressing parasitic or otherwise undesired network interactions. The present algorithm offers a concrete example of the utility of such an "unlearning" process in a simple network. 6 Appendix

Derivation of equation 3.1: Since Q is a positive definite symmetric matrix, it can be shown that 1ndetQ = TrlnQ, and that the differential quantity dH = (l/Z)d(TrlnQ) = (1/2)Tr(Q-'dQ). Also, dQmp/dCni = (C9)mi6p,+ (C9),iSm,, where S denotes the Dirac delta function. Therefore, dH/aC,i = (Q-'Cq),j. Derivation of equation 3.4: The recursive method is due to Jacobi. Equations 3.2 and 3.3 yield ay(ca) = aCp",,F'Y = a(I - F)-'Y = Q-'Y. Combining this result with equation 3.1 yields the desired result. Convergence condition for a (see text following equation 3.5): Let f denote the eigenvalues of F, and let A+ denote the maximum and minimum eigenvalues of Q (0 < A- < A+). Iff a < 2/A+, then max I f ] < 1 and the series (I - F)-' = CZJ' converges.

Local Synaptic Learning Rules

701

+

For faster convergence, max If1 should be small. When a = 2/(A+ and equals (A+ - XL)/(A+ L). The convergence rate is slow when max If1 is close to unity, i.e., when the condition number of Q, A+/A-, is large. Note that the variance of the output noise term v, can be used to control the condition number: increasing all output noise variances by a constant increases A+ and A- by the same amount, hence decreases the condition number and improves convergence. Derivation of equations 3.8 and 3.10: V , = (CqLCT),, yields W,/X,, = 2(C9L)n,6,n.We also have

+

L), max If\ is minimized

Q& = (CqLCT)rnp+ P V m 6 m p

(A.1)

hence

+

dQL,,/dCm = (CqL)mi6pn+ (CqL)pJmn 2P(CqL)ni6mn6pn

(A.2)

and d[(1/2) lndetQLI/dCni = [(QL)-'C9']ni

+ P[(QL)-']nn(C9L)ni

(A.3)

Subtracting the corresponding expression involving QU yields equation 3.8. Next, using equation 3.9 and (vkv;) = PS,, we obtain

a+(yk(CO; 4 )v;)

a+( [CZ, (F4)tv'1n 4 ) = ( [(Q'I-'

v'] n

)

= &n[(Q4)-1]nm(vkv;)= P [ ( Q ' ) - l ] n n

(A.4)

Combining this result with (C9L),1= (YnXI)Lyields equation 3.10.

References Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Comp. 2, 308-320. Atick, J. J., and Redlich, A. N. 1991. Predicting ganglion and simple cell receptive field organizations. Int. J. Neural Syst. 1, 305-315. Barlow, H. B. 1989. Unsupervised learning. Neural Camp. 1,295-311. Barto, A. G., and Anandan, P. 1985. Pattern-recognizing stochastic learning automata. IEEE Trans. Sys. Man Cybern. 15, 360-375. Crick, F. H. C., and Mitchison, G. 1983. The function of dream sleep. Nature (London) 304, 111-114. FoldiBk, I? 1989. Adaptive network for optimal linear feature extraction. In Proc. IEEEIINNS Intern. Joint Conf. Neural Networks, Washington, DC, Vol. 1, pp. 401-405. IEEE Press, New York. Green, A. A., Berman, M., Switzer, P., and Craig, M. D. 1988. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. I E E E Trans. Geosci. Remote Sensing 26, 65-74. Hinton, G. E., and Sejnowski, T. J. 1983. Optimal perceptual inference. Proc. I E E E Conf. Computer Vision, 448453. Leen, T. K. 1991. Dynamics of learning in linear feature-discovery networks. Network 2, 85-105.

Ralph Linsker

702

Linsker, R. 1988. Self-organization in a perceptual network. Computer 21(March), 105-11 7. Linsker, R. 1989a. An application of the principle of maximum information preservation to linear systems. In Advances in Nmral Information Processing Systems 1, D. S. Touretzky, ed., pp. 186-194. Morgan Kaufmann, San Mateo, CA. Linsker, R. 1989b. How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comp. 1,402-411. Pearlmutter, B. A., and Hinton, G. E. 1986. G-maximization: An unsupervised learning procedure for discovering regularities. In Neural Networks for Computing, J. S. Denker, ed., pp. 333-338. American Institute Physics, New York. Sanger, T. 1989. An optimality principle for unsupervised learning. In Advances in Neural Information Processing Systems 1 , D. S. Touretzky, ed., pp. 11-19. Morgan Kaufmann, San Mateo, CA. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. Univ. of Illinois Press, Urbana. .-~

Received 27 December 1991; accepted 28 February 1992.

This article has been cited by: 2. Jim W. Kay, W. A. Phillips. 2010. Coherent Infomax as a Computational Goal for Neural Systems. Bulletin of Mathematical Biology . [CrossRef] 3. Ryotaro Kamimura. 2009. Structural enhanced information and its application to improved visualization of self-organizing maps. Applied Intelligence . [CrossRef] 4. Hao Miao, Xiaodong Li, Jing Tian. 2008. An EME blind source separation algorithm based on generalized exponential function. Journal of Electronics (China) 25:2, 262-267. [CrossRef] 5. Thomas D. Coates, Jr.. 2008. Neural Interfacing: Forging the Human-Machine Connection. Synthesis Lectures on Biomedical Engineering 3:1, 1-112. [CrossRef] 6. Hongtao Du, Hairong Qi, Xiaoling Wang. 2007. Comparative Study of VLSI Solutions to Independent Component Analysis. IEEE Transactions on Industrial Electronics 54:1, 548-558. [CrossRef] 7. R. Kamimura. 2006. Cooperative Information Maximization With Gaussian Activation Functions for Self-Organizing Maps. IEEE Transactions on Neural Networks 17:4, 909-918. [CrossRef] 8. Simone Fiori . 2005. Nonlinear Complex-Valued Extensions of Hebbian Learning: An EssayNonlinear Complex-Valued Extensions of Hebbian Learning: An Essay. Neural Computation 17:4, 779-838. [Abstract] [PDF] [PDF Plus] 9. X. Li, R. Du, X.P. Guan. 2004. Utilization of Information Maximum for Condition Monitoring With Applications in a Machining Process and a Water Pump. IEEE/ASME Transactions on Mechatronics 9:4, 711-714. [CrossRef] 10. Gurinder Atwal. 2004. Dynamic plasticity in coupled avian midbrain maps. Physical Review E 70:6. . [CrossRef] 11. Simone Fiori . 2003. Closed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function NeuronsClosed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function Neurons. Neural Computation 15:12, 2909-2929. [Abstract] [PDF] [PDF Plus] 12. Gal Chechik . 2003. Spike-Timing-Dependent Plasticity and Relevant Mutual Information MaximizationSpike-Timing-Dependent Plasticity and Relevant Mutual Information Maximization. Neural Computation 15:7, 1481-1510. [Abstract] [PDF] [PDF Plus] 13. Fabian J. Theis , Andreas Jung , Carlos G. Puntonet , Elmar W. Lang . 2003. Linear Geometric ICA: Fundamentals and AlgorithmsLinear Geometric ICA: Fundamentals and Algorithms. Neural Computation 15:2, 419-439. [Abstract] [PDF] [PDF Plus]

14. S. Fiori. 2002. Information-theoretic learning for FAN network applied to eterokurtic component analysis. IEE Proceedings - Vision, Image, and Signal Processing 149:6, 347. [CrossRef] 15. Ryotaro Kamimura, Taeko Kamimura, Thomas R. Shultz. 2001. Information Theoretic Competitive Learning and Linguistic Rule Acquisition. Transactions of the Japanese Society for Artificial Intelligence 16, 287-298. [CrossRef] 16. D. R. C Dominguez, M Maravall, A Turiel, J. C Ciria, N Parga. 1999. Numerical simulation of a binary communication channel: Comparison between a replica calculation and an exact solution. Europhysics Letters (EPL) 45:6, 739-744. [CrossRef] 17. Ralph Linsker . 1997. A Local Learning Rule That Enables Information Maximization for Arbitrary Input DistributionsA Local Learning Rule That Enables Information Maximization for Arbitrary Input Distributions. Neural Computation 9:8, 1661-1665. [Abstract] [PDF] [PDF Plus] 18. Marc M. Van Hulle. 1997. The Formation of Topographic Maps That Maximize the Average Mutual Information of the Output Responses to Noiseless Input SignalsThe Formation of Topographic Maps That Maximize the Average Mutual Information of the Output Responses to Noiseless Input Signals. Neural Computation 9:3, 595-606. [Abstract] [PDF] [PDF Plus] 19. Anthony J. Bell , Terrence J. Sejnowski . 1995. An Information-Maximization Approach to Blind Separation and Blind DeconvolutionAn Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7:6, 1129-1159. [Abstract] [PDF] [PDF Plus] 20. Michael Haft, Martin Schlang, Gustavo Deco. 1995. Information theory and local learning rules in a self-organizing network of Ising spins. Physical Review E 52:3, 2860-2871. [CrossRef] 21. Marco Idiart, Barry Berk, L. F. Abbott. 1995. Reduced Representation by Neural Networks with Restricted Receptive FieldsReduced Representation by Neural Networks with Restricted Receptive Fields. Neural Computation 7:3, 507-517. [Abstract] [PDF] [PDF Plus] 22. G. Deco , W. Finnoff , H. G. Zimmermann . 1995. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer NetworksUnsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks. Neural Computation 7:1, 86-107. [Abstract] [PDF] [PDF Plus] 23. Dawn M. Adelsberger-Mangan, William B Levy. 1994. The influence of limited presynaptic growth and synapse removal on adaptive synaptogenesis. Biological Cybernetics 71:5, 461-468. [CrossRef] 24. L. F. Abbott. 1994. Decoding neuronal firing and modelling neural networks. Quarterly Reviews of Biophysics 27:03, 291. [CrossRef]

25. Dawn M. Adelsberger-Mangan, William B. Levy. 1993. Adaptive synaptogenesis constructs networks that maintain information and reduce statistical dependence. Biological Cybernetics 70:1, 81-87. [CrossRef] 26. Riccardo BoscoloIndependent Component Analysis . [CrossRef]

Communicated by Terrence J. Sejnowski

On the Information Storage Capacity of Local Learning Rules Giinther Palm Vogt-Institute for Brain Research, University of Diisseldorf D-4000 Diisseldorf, Germany

A simple relation between the storage capacity A for autoassociation and H for heteroassociation with a local learning rule is demonstrated: H = 2A. Both values are bounded by local learning bounds: A 5 LA and H 5 LH. L H = LA is evaluated numerically.

1 Introduction Neural networks with modifiable synaptic connections are now the standard modeling paradigm for learning and associative memory. The recent scientific literature on this subject contains an enormous number of such models, all of them very similar in their basic structure, their rules for synaptic modification, and their qualitative behavior, but different in many details. This paper is concerned with local two-term rules for synaptic modification used in large networks of nonlinear model neurons. We do not compare the different retrieval procedures in detail. For this reason we shall mostly concentrate on the storage procedures, i.e., on the first box in Figure 1. The criterion for a comparison and evaluation of different local synaptic storage procedures or learning rules will be the information storage capacity. This is essentially the channel capacity of the channel depicted in Figure 1, or-as in the next section-of the first box alone. More explicitly, we shall consider the amount of information about the input set S that can be obtained from the storage matrix M . One could agree on calling the capacity of the first box in Figure 1 the ”storage capacity,” the capacity of the last box, the ”retrieval capacity” and the capacity of the whole channel, the “memory capacity.” There are two essentially different cases, namely autoassociative and heteroassociative storage. In the next section we will demonstrate a relationship between the storage capacities for these two cases. Finally, in the last section we shall determine these capacities numerically. Neural Computation 4, 703-711 (1992) @ 1992 Massachusetts Institute of Technology

Giinther Palm

704

4

storage

Information to be stored

,passage

/ -

of time

state of the storage medium

S

> retrieval

I+

retrieved information

M

Figure 1: The memory channel.

2 Capacity of Local Storage Procedures, Heteroassociation versus

Autoassociation The most common synaptic arrangement in the cerebral cortex (and the hippocampus) is the simple dyadic synapse. It connects just between two neurons, the presynaptic and the postsynaptic one. Therefore there are just two natural, locally available activity signals: the presynaptic and the postsynaptic activity. Consequently we concentrate on two-term local synaptic rules, i.e., learning rules where the change of synaptic connectivity depends only on two variables x and y, which represent the pre- and postsynaptic activities. We consider two essentially different memory tasks, namely heteroassociation and autoassociation. In autoassociation a set S ( n ) = {x', . . . ,x"} of "patterns" (n-vectors) is stored by forming the connectivity matrix M = CEl R(xT,xF), where the function or rule Rjx,y) determines explicitly the amount of synaptic connectivity change for a pair (x,y) of preand postsynaptic activity values. In heteroassociation a set S(n,m ) = {(x',yl), . . . , (x",y")} of pairs of patterns is stored by forming the matrix M = C,&(xf,y;). The amount of information about S that can be retrieved from M is the storage capacity H ( n ,m ) for a heteroassociative n x m matrix, and A ( n )

Local Learning Rules

705

for an autoassociative n x n matrix (see also Palm 1980, 1982). Section 2 illustrates how this information can actually be obtained. Now the asymptotic storage capacity (per synapse) is defined as the limit A = limA(n)/n2 and H = limH(n,m)/(m.n). Clearly these two limits depend crucially on the properties of the sets S ( n ) or S(n,m) that are to be stored. In the framework considered here the sets S are assumed to be randomly generated: each component xi of each vector x is an independent binary random variable with the same probability q of becoming 1. Similarly the output patterns for heteroassociation are independently generated and p = prob[yr = 11. In the case of heteroassociation there is only one reasonable retrieval strategy. Given the storage matrix M and one input pattern x p one can try to obtain an estimate for the corresponding pattern yp by means of threshold detection in each component. Thus the problem of retrieving the outputs yp to their corresponding inputs x p from the matrix can be reduced to the problem of retrieving one component yj of one output pattern yf’ from the jth column of the matrix M and the input vector x b . The simplest and in fact most efficient estimate for yj can be obtained from the inner product of x p and this column (compare equation 3.1 below). This observation shows that the capacity H ( n ,m ) does not really depend on the second parameter m. We therefore also choose the parameters p ( n ) , q(n), and M ( n ) defining the sets S(n,m) as depending only on n, not on m. When we want to compare heteroassociation to autoassociation we have in addition to choose p ( n ) = q(n). One of the essential features of associative memory models is the distributed nature of the memory: different memories are allowed to overlap and thus to affect the same synapses, so that each entry in the synaptic connectivity matrix M may contain the superposition of several memory traces, i.e., for most index pairs i, j the sum CkR(xf,y); should have more than one nonzero contribution. This also implies a distributed representation of information in the patterns to be stored, i.e., in a usual pattern x there should be more than one nonzero component xi. This has the important consequence that a nontrivial “distance relation” between different patterns can be defined by the amount of overlap between them, whereas patterns without overlap all have the same maximal distance from each other (see also Palm 1988,1990). To represent this quest for a distributed representation in an associative memory, we explicitly introduce the following restriction on S: Distributedness Most of the patterns xk and yk occurring in the set S must have more than one nonzero component. For stochastic pattern generation distributedness simply means that p ( n ) > l / n and q(n) > l / n . Without this restriction it would be possible to store isolated activity patterns, each in one row (or column) of the memory matrix M, and address them by activity patterns with only one nonzero component in the place of the corresponding row (or column).

Giinther Palm

706

postsynaptic activity I

0

Figure 2: The four parameters rl ,r2, r3, r4 describing a local learning rule.

This would trivially lead to a storage capacity of H = 1, at least in the setup of binary activity vectors, or two-state neurons.

Proposition 0. Without requiring distributedness a heteroassociative storage capacity of H = 1 can be achieved, even with binary synapses. We now want to study the dependence of A = A(R) or H = H ( R ) on the choice of the rule R. To simplify the discussion, we restrict ourselves to network models with fwo-state neurons (as in the spin-glass Iiterature), although many of the subsequent arguments could be extended to more general neuron models. This restriction means that the variables x and y determining the local synaptic change, can have only two values, here a and 1, where a is usually either 0 or -1, and thus a rule R ( x , y) is defined by 4 numbers (rl ,r2, r3, r4) as in Figure 2. The following two propositions contain some simple observations based on linear algebra and elementary information theory. They are proved in Palm (1988).

Proposition 1. Let N be the set of all local rules R for which A(R) = 0. (1) N is a linear subspace of the space R of all local rules. (2) If Ro E N, then A(R) = A(R Ro) and H ( R ) = H ( R Ro) for any R E R.

+

+

Proposition 2. The subspace N of R is three-dimensional. It is spanned by ( l , l , l , l )(-1,1, , -l,l), and (-l,-l,l,l). The (up to a constant factor unique) rule C which is orthogonal to N is C = (1,-1, -1,l). There are two constants A # 0 and H # 0 such that A ( R )= A and H ( R ) = H for every R $ N.

+

Definition. A local rule R is called Hebb-like, if ( R I C) = rl - r2 - r3 r4 > 0, anti-Hebb, if ( R 1 C) < 0, and noninteractive if ( R I C) = 0. In the remainder of this section we demonstrate a simple relation between A and H , namely 2A = H .

Local Learning Rules

707

Before we proceed let us recall the assumptions needed for the next three propositions (3-5). We assume local two-term rules, two-state neurons, stochastic pattern generation, and an arbitrary but fixed choice of the parameters M ( n ) and p ( n ) = q(n) as n goes to infinity. Proposition 3. H(n, i + j )

= H(n, i)

+ H(n,j) and H(i +j, n) 2H(i,n ) +

HO', n). Proof. Obvious.

Thus supH(n,m)/(rnn ) = liiH(n, rn)/(rn n ) = lipH(n, m)/(rnn ) n.m

+

Proposition 4. limn[A(n 1) - A(n)]/n = H for p ( n ) = 9(n). Proof. The matrix M realizing A(n + 1) can be decomposed into an n x n autoassociative matrix M', which approximately realizes the storage capacity A(n), two times the same vector x (appearing as an n x 1 and a 1 x n matrix) and one number. This decomposition shows that IA(n + 1)- A(n) - H ( n , l)l/n -+ 0. 0 Proposition 5. H = 2A for p ( n ) = q(n).

Proof. By Proposition 4 we can write [A(i)- A(i - l)]/i = H where e(i) goes to zero and thus e(i) < e for i 2 L. So, n

n--tm lim A(n)/n2

+ e(i),

n

=

limx[A(i)- A(i - l)]/n2 = l i m x [ H + e(i)]i/n2

=

lim H[n(n I ) - L ( L + 1)]/2n2

i=L

+

i=L

n

+ lim C e(i)i/n2= ~

/ 2

i=L

The result of Proposition 5 is intuitively plausible since autoassociation always leads to a symmetric storage matrix, and a symmetric matrix contains roughly half the information of an arbitrary matrix of the same size. 3 Capacity of Heteroassociative Storage and Retrieval, Evaluation of H and A

In this section we discuss certain particular choices of the parameters M ( n ) ,p ( n ) , and 9(n) for the case of heteroassociation, leading in particular to the optimal choice for fixed p ( n ) = p . For heteroassociative storage there is only one natural retrieval procedure: Given S = { ( x k , $ ) : k = 1,. . . , M } and the matrix M, we take each xk (k = 1,. . . ,M) and form xkM. Then we choose a vector 0 of detection thresholds and form yk = h(xkM - 0), i.e., yf = h[(xkM- O)i], where h ( x ) = 1 if x > 0 and h ( x ) = 0 otherwise. This vector y" is taken

Gunther Palm

708

as an estimate for the desired output vector yk and we have to estimate the amount of information that is needed to correct the errors, i.e., the deviation between @ and y" for an optimally chosen threshold 0. The estimate yf for the jth component of the kth output pattern yk is obtained as

In this expression, N contains no information about yf and can be regarded as noise, S can be regarded as the signal containing the information on $, and 0 = 0j can be regarded as a detection threshold. Obviously the rule R should be chosen in such a way that S is large for yf = 1 and small for yf = 0. We define the signal-to-noise ratio r for the detection problem as Y := E[C,$R($,

1) - Cj$R(xf, O ) ] / a ( N )

where a(N)denotes the standard deviation of N,given the input xk. This signal detection problem has been analyzed by several authors (Palm 1988, 1990; Willshaw and Dayan 1990; Nadal and Toulouse 1990) and there is one result that is of particular interest here. It concerns the so called sparse kigk-fidelity limit. In this limit the parameters p(n), 9(n), and M(n) are chosen in such a way that p(n)/ 9(n) as well as the error probabilities converge to zero as n goes to infinity. In this limit a capacity of 1/(2 In 2) can be achieved with optimal threshold detection. Thus H = 1/(2 In 2) for this choice of parameters p(n), 9(n),and M ( n ) . Furthermore the local rule achieving maximal signal-to-noise ratio Y cart be identified (see Palm 1990; Willshaw and Dayan 1990)and its value for Y is (3.2)

12 = I/bP(1 - PI1

From this relation one can immediately see that high fidelity, i.e., Y co and nonzero capacity ( a 0), can be achieved only if p(1 - p ) goes to zero. Here it is reasonable to break the inherent symmetry and require that p ---t 0. With this knowledge we can also determine H for another extreme choice of parameters: the no-fidelity or error-full case. In this case we take p(n) = p and 9(n) = 9, both constant, and let M ( n ) increase to extremely large values so that r 0 and both error probabilities converge to 112 as TI 4 00. Still one can retrieve (at least in principle) some information from the memory matrix. Given p19 and M and therefore r, we can estimate the error probabilities by means of the cumulative gaussian distribution G as e = G(-r/2). -+

--f

-+

Local Learning Rules

709

Since e + 1/2 and thus r + 0, we can approximate G(-r/2) as e = G(-r/2)

=

1/2 - r/(2v%)

(3.3)

l W

(3.4)

Thus

H

=

N P ) -P

-

e)/Pll

-

PowPo)}

Here I ( p ) = - p log p - (1 - p ) log,(l - p ) and p l = p(l - e) = pe (1 - p)(l - e). The second order approximation to I@’) around p is

+

po

(In 2 ) W ) = (In 2MP) + (P’-P)ln[(l -P)/Pl-

+ (1 - p ) e and

(P’-P)2/[2P(l -PI1 (3.5)

If we insert equations 3.2, 3.3, and 3.5 into equation 3.4 we obtain H

= l / ( In ~ 2)

(3.6)

By Proposition 5 we obtain A = 1/(27r In 2). Interestingly these values turn out to be independent of p and 9. We summarize these observations in the following proposition. Proposition 6. In the sparse limit p + 0 a capacity of H = 1/(2 In 2) can be achieved with high fidelity retrieval. In the nonsparse case for arbitrary fixed p , a capacity of H = l / ( In~ 2) can always be achieved in the no-fidelity limit. Of course, it does not make too much sense to let M ( n ) increase so quickly that both error probabilities go to 1/2. We have analyzed more judicious choices of M ( n ) numerically, and it turns out that one can always reach values above the no-fidelity limit, although we have never been able to reach the sparse high-fidelity limit. For p = 9 = 1/2 the no-fidelity limit actually seems to be the optimum. We finally consider the general case where p ( n ) = 9(n) = p . In the case of heteroassociation it turns out that H does not depend on the choice of 9(n), so we can choose 9(n) = p ( n ) for easier comparison to autoassociation. We may define the local learning bound Lh(p) or L,(p) as the optimal value for H , or A, respectively, that can be achieved for p(n)[=9(n)] = p and the best choice of M ( n ) . Our numerical investigations of this bound suggest the following proposition. Proposition 7. Lh(p) decreases monotonically from L h ( 0 ) := lim L h ( p ) = 1/ (2 In 2) to L h (1/2) = 1/ ( T In 2). By Proposition 5 again L h = 2L,. Our final result as stated in Propositions 6 and 7 is certainly important for a large class of memory models based on local learning rules in neural networks. In particular, the result on autoassociation can be used as an upper bound for the memory capacities that can be achieved with concrete local learning rules and concrete retrieval procedures, like the fixed-point information retrieval capacities for a number of Hopfield-like

710

Giinther Palm

Figure 3: The local learning bound LH as a function of the density p of ones. spin-glass models (e.g., Hopfield 1982; Amit ef al. 1987; Horner 1989; Tsodyks and Feigelman 1988). We have recently been able to determine the function Lh(p) numerically; a plot is provided in Figure 3.

References Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1987. Information storage in neural networks with low levels of activity. Phys. Rev. A 35, 2293-2303. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Horner, H. 1989. Neural networks with low levels of activity: king vs. McCulloch-Pitts neurons. Z. Phys. B 75, 133-136. Nadal, J. P., and Toulouse, G. 1990. Information storage in sparsely coded memory nets. Network I, 61-74. Palm, G . 1980. On associative memory. Biol. Cybern. 36, 19-31. Palm, G. 1982. Rules for synaptic changes and their relevance for the storage of information in the brain. In Cybernetics and Systems Research, R. Trappl (Ed.), pp. 277-280. North-Holland Publishing Company, Amsterdam. Palm, G. 1988. On the asymptotic information storage capacity of neural networks. In Neural Computers, C. Von der Malsburg and R. Eckmiller (Eds.), pp. 271-280. Springer-Verlag, Berlin.

Local Learning Rules

711

Palm, G. 1990. Local learning rules and sparse coding in neural networks. In Advanced Neural Computers, R. Eckmiller (Ed.), pp. 145-150. Elsevier-Science Publishers B.V., North-Holland. Tsodyks, M. V., and Feigelman, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Left. 6, 101-105. Willshaw, D., and Dayan, P. 1990. Optimal plasticity from matrix memories: What goes up must come down. Neural Comp. 2,85-93.

Received 11 December 1990; accepted 18 March 1992.

This article has been cited by: 2. Gal Chechik , Isaac Meilijson , Eytan Ruppin . 2001. Effective Neuronal Learning with Ineffective Hebbian Learning RulesEffective Neuronal Learning with Ineffective Hebbian Learning Rules. Neural Computation 13:4, 817-840. [Abstract] [PDF] [PDF Plus] 3. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus]

Communicated by Haim Sompolinsky

Learning Curves for Error Minimum and Maximum Likelihood Algorithms Y. Kabashima S. Shinomoto Department of Physics, Kyoto University, Kyoto 606, Japan

For the problem of dividing the space originally pad tionec ~y a blurred boundary, every learning algorithm can make the probability of incorrect prediction of an individual example E decrease with the number of training examples t. We address here the question of how the asymptotic form of ~ ( tas ) well as its limit of convergence reflect the choice of learning algorithms. The error minimum algorithm is found to exhibit rather slow convergence of E ( t ) to its lower bound E ~ &(f) , - EO O(t-2/3).Even for the purpose of minimizing prediction error, the maximum likelihood algorithm can be utilized as an alternative. If the true probability distribution happens to be contained in the family of hypothetical functions, then the boundary estimated from the hypothetical distribution function eventually converges to the best choice. Convergence of the prediction error is then E ( t ) - E O O(t-'). If the true distribution is not available from the algorithm, however, the boundary generally does not converge to the best choice, but instead E ( t ) - EI fO(t-'/2), where > EO > 0.

-

-

-

1 Introduction

The original purpose of machine learning is to adjust the machine parameters so as to reproduce the input-output relationship implied by the examples. Learning situations can be classified into two cases depending upon whether or not the machine is in principle able to completely reproduce the individual examples. In the case that the machine is able to reproduce examples, the remaining interest is the estimate of generalization error: the probability of the incorrect prediction E of a novel example provided that the machine has succeeded to reproduce t examples. The problem has currently been resolved by two means: computational theoretical, and statistical mechanical. First, the idea of PAC learning by Valiant (1984) was applied by Baum and Haussler (1989) to the worst case estimate of the generalization error of the neural network models. Second, a statistical mechanical theory for typical case estimate of the generalization error is formulated under the Bayes formula by Levin et Neural Computation 4, 712-719 (1992) @ 1992 Massachusetts Institute of Technology

Error Minimum and Maximum Likelihood

713

al. (1990). Amari et al. (1992) classified the asymptotic scaling forms of the ) four types. The statistical theory is not restricted learning curves ~ ( tinto to the case that a machine can reproduce the raw examples. Actually, two among the four types of the scaling forms are concerned with the case that the examples shown by the supervisor are more or less noisy. We take up here the convergence of prediction error for dividing the space originally partitioned by a blurred boundary. The purpose of the learning would not be unique in this case; one may seek the probability distribution of the classification, or one may seek the best boundary so as to minimize the prediction error for individual examples. The maximum likelihood algorithm and the error minimum algorithm are the corresponding standard strategies for these motivations. The two strategies are identical if the family of hypothetical distribution functions for the maximum likelihood algorithm are stepwise and symmetrical [see Rissanen (1989); Levin et al. (1990)l. In the case that the hypothetical distribution functions are smooth, however, the two strategies are generally different from each other. We found that the convergence of the error minimum algorithm is rather slow. In this algorithm, c ( t ) converges to the lower bound EO with O(t-2/3).We will explain the source of the the asymptotic form E ( t ) - E O fractional exponent 2/3 theoretically. Even for the purpose of minimizing prediction error, we can use the maximum likelihood algorithm as an alternative. In this case, the boundary can be defined as a hypersurface on which the hypothetical probabilities for alternative classes balance with each other. If the true probability distribution is available from the algorithm, the prediction error converges rapidly as E ( t ) - EO O(t-'). In the case that the true distribution is not available from the algorithm, the boundary generally does not converge to the best choice, but ~ ( t-)~1 +O(t-'/2),where EI > EO > 0.

-

-

-

2 Numerical Simulation

We first show the result of numerical simulation of the following simple partition problem. Every example consists of the real input x E [0,1]and the binary output s = f l . Real number x is drawn independently from the uniform distribution over the interval, p ( x ) = 1. The probability of ~ p ( s = -1 I getting s = f l depends on x as p ( s = fl I x) = 0.1 0 . 7 and X ) = l-p(s = +1 1 x ) (Fig. la). We examined the following three strategies for the partition of the interval: (1) the error minimum algorithm, ( 2 ) the maximum likelihood algorithm with the family of probability functions qw(s = +1 I x ) = w 0.7x, and (3) the maximum likelihood with q,(s = +1 I X ) = w 0 . 4 ~ . The error minimum algorithm seeks the partition that minimizes the total number of the left points with s = +1 and the right points with s = -1. As the number of examples increases the partition point x, is

+

+

+

714

Y. Kabashima and S. Shinomoto

t

Figure 1: (a) The original probability distribution p ( s = +1 I x ) = 0.1 + 0 . 7 ~ . (b) The best partition (0 = 4/7 for the error minimum. (c-e) The best hypothetical distribution functions for the maximum likelihood with pw(s = +1 I x) = w + 0.7x,w + 0.4x, and 6 + (1 - 26)8(x - w).

expected to approach the point t o = 417 at which the probabilities for the alternative classes balance: p ( s = $1 I to)= p(s = -1 I to)(Fig. lb). For the given partition at xo, the probability of incorrect prediction is E = EO a(x, - to)’, where EO = 9/28 and a = 0.7’. In this algorithm, the possible position of the optimal partition x, is given by the interval of adjacent points (xi, x,) and the error measure has to be averaged over the interval.

+

Error Minimum and Maximum Likelihood

715

The maximum likelihood algorithm seeks the optimal parameter value

w,, which maximizes the likelihood function, (2.1) S ,X

The original probability distribution is available from the algorithm (2), and the optimal parameter w,is expected to approach wo = 0.1, which minimizes the Kullback divergence. As a result, the optimal partition x, estimated by sW,,(s= +l I x,) = sw(,(s= -1 I x,) eventually approaches to the best choice, t o = 4/7 (Fig. lc). On the other hand, algorithm ( 3 ) does not contain the true distribution function, and the optimal parameter is expected to approach to a value w1, which minimizes the Kullback divergence. In this case, the optimal partition x, approaches to a point (7 remote from to= 4/7 (Fig. Id). Note again that the maximum likelihood is identical to the error minimum if the family of hypothetical functions is stepwise and symmetrical (Fig. le), although this is rather exceptional as a maximum likelihood algorithm. In the numerical simulation, the three algorithms are carried out to obtain the optimal partition x, and the prediction error E for the set of t examples drawn from the distribution, p(s I x)p(x). The average of the prediction error E taken over 1000 sets of examples is plotted in Figure 2. The plots of E ( t ) - EO for (1) and (2) exhibit the scaling E ( t ) - EO O(t-"), with the exponents IY = 0.670 f0.004 and (Y = 1.012+L 0.008, respectively. The prediction error according to the algorithm (3)does not converge to the lower bound EO but to € I ( > E O ) . The mean square deviation of E ( t ) from ~1 in case ( 3 ) is depicted in Figure 3, where we can see & ( t )- €1 f O (t-") with the exponent N = 0.520f0.004. These results are examined in the next section. N

-

3 Theoretical Interpretation

In order to elucidate the nontrivial exponent obtained from the error minimum (l), we wish to consider here the simpler situation that the examples are arranged at regular intervals of l / t , assuming the same form for p(s I x). Let each example be denoted by the sequence j from 1 to t. The probability of sj taking the value s = fl is given by p(s I x = j / t ) . The total number of errors for the partition between i and i + 1 is

E,

=

k(l+ s,)/2 + f:(1

~

]=1

s,)/2

(3.1)

j=i+l

The expectation value of the number of errors is estimated as

(6) (Ern)+ ( a / t ) ( i - m)' N

(3.2)

where m is the best partition which minimizes the difference of the alternative probabilities, Ip(s = +1 I x = m / t ) - p ( s = -1 I x = m / t ) ] . O n

Y.Kahashima and S. Shinomoto

716

G-E >

average prediction error

0

lo2 k

k

I O3 1O4

1O5

103

104

105

Figure 2: Average of E ( t ) - E O . (b), (c), and (d) correspond to cases (l),(2), and (31, respectively. The lines for (1) and (2) were drawn from the least square fit: E ( t ) - EO IX t P with the exponents N = 0.670 f 0.004 and 1.012 f 0.008, respectively. the other hand, the mean square deviation of the difference E; - E,, is approximated as

AE2 = ( ( E ; - Em)') - ( ( E i - Em))*

-

(i - rnl

(3.3)

This is the result of a "random walk" of E; (see Fig. 4). Thus the optimal partition i that minimizes the number of errors E; can fluctuate around rn. The order of the deviation is estimated by the balance (AE( (Ei - Em), which implies Ii - m( O(t2/3),or fx, - (01

-

-

0(t-'/3)

-

(3.4)

Error Minimum and Maximum Likelihood

<(&-&I

i

$2)

717

mean square deviation

1o5

10'

1o9

t

10'" 1o3

1o5

lo"

Figure 3: Average of [ ~ ( f )- EO]' for (1) and (2), and ( ~ ( t-) EI]' for (3). They exhibit the scaling t P with the exponents cy = 1.313 f 0.016,1.976f 0.014, and 1.040 f 0.007, respectively. The prediction error is thus estimated as E(t) =

(Ei)/t = (Em)/t

+ ( u / t 2 ) ( i- m)'

= €0

+O(F~'~)

(3.5)

The numerical result of (1)is consistent with this fractional exponent 2 / 3 . The remaining two asymptotic scaling forms for the maximum likelihood algorithms (2) and (3) can be explained by the conventional theory. The variance of the maximum likelihood estimator wo is known to obey the asymptotic scaling,

([wo(t) -4 ')c( t-'

(3.6)

Y. Kabashima and S. Shinornoto

718

<E> \

Figure 4: Schematic representation of the fluctuation of E; around ( E i ) . where w = limf-m(wo(f)). Thus the deviation of w, from w is of the order of t-'/2. Deviation of the position of the boundary xu from = limf+m(x,(t))is proportional to the one of w o from w. In the case that the limit [ is identical to the best choice, = (0, the prediction error is estimated as

<

<

& ( t )- &o c< ( x , - [o)2 = 0(t-')

On the other hand, if the limit error is €(f)

N

El

+ 2c(& (0)(x, + c([1 - -

<=

- (1)

(3.7) is remote from

= &I f 0 ( t - 1 ! 2 )

(0,

the prediction

(3.8)

where €1 = ~0 EO. The numerical results of (2) and ( 3 ) are, respectively, consistent with the scaling forms of equations 3.7 and 3.8.

Error Minimum and Maximum Likelihood

719

It is not so difficult to show that these three types of learning curves are not sensitive to the choice of the problems as well as the dimensionality of the space x. The point will be discussed elsewhere. In this paper we did not take into account the computational complexity of these algorithms as well as the estimate of the error measure E itself such as discussed by Haussler (1991). The problems are left to future studies.

Acknowledgments We thank Shun-ichi Amari, Hideki Asoh, Kenji Yamanishi, and Michael Crair for helpful discussion. The present work is partly supported by a Grant-in-Aid for Scientific Research by the Ministry of Education, Japan,

No. 0325105.

References Amari, S., Fujita, N., and Shinomoto, S. 1992. Four types of learning curves. Neural Comp. 4, 605-618. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Camp. 1, 151-160. Haussler, D. 1991. Decision theoretic generalizationof the PAC model for neural net and other learning applications. UCSC-CRL-91-02. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural network. Proc. IEEE 78, 1568-1574. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore. Valiant, L. G. 1984. A theory of learnable. Commun. ACM 27(11), 1134-1142.

Received 16 September 1991; accepted 5 February 1992.

This article has been cited by: 2. Tatsuya Uezu. 2002. On the Conditions for the Existence of Perfect Learning and Power Law Behaviour in Learning from Stochastic Examples by Ising Perceptrons. Journal of the Physics Society Japan 71:8, 1882-1904. [CrossRef] 3. Yoshiyuki Kabashima, Jun-ichi Inoue. 1998. Journal of Physics A: Mathematical and General 31:1, 123-144. [CrossRef] 4. Tatsuya Uezu, Yoshiyuki Kabashima. 1996. Journal of Physics A: Mathematical and General 29:17, L439-L445. [CrossRef] 5. Yoshiyuki Kabashima , Shigeru Shinomoto . 1995. Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without QueriesLearning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries. Neural Computation 7:1, 158-172. [Abstract] [PDF] [PDF Plus] 6. Motoaki Kawanabe , Shun-ichi Amari . 1994. Estimation of Network Parameters in Semiparametric Stochastic PerceptronEstimation of Network Parameters in Semiparametric Stochastic Perceptron. Neural Computation 6:6, 1244-1261. [Abstract] [PDF] [PDF Plus] 7. Y Kabashima. 1994. Journal of Physics A: Mathematical and General 27:6, 1917-1927. [CrossRef]

Communicated by John Bridle

The Evidence Framework Applied to Classification Networks David J. C. MacKay" Computation and Neural Systems, California Institute of Technology, Pasadena, C A 91125 U S A

Three Bayesian ideas are presented for supervised adaptive classifiers. First, it is argued that the output of a classifier should be obtained by marginalizing over the posterior distribution of the parameters; a simple approximation to this integral is proposed and demonstrated. This involves a "moderation" of the most probable classifier's outputs, and yields improved performance. Second, it is demonstrated that the Bayesian framework for model comparison described for regression models in MacKay (1992a,b) can also be applied to classification problems. This framework successfully chooses the magnitude of weight decay terms, and ranks solutions found using different numbers of hidden units. Third, an information-based data selection criterion is derived and demonstrated within this framework.

1 Introduction

A quantitative Bayesian framework has been described for learning of mappings in feedforward networks (MacKay 1992a,b). It was demonstrated that this "evidence" framework could successfully choose the magnitude and type of weight decay terms, and could choose between solutions using different numbers of hidden units. The framework also gives quantified error bars expressing the uncertainty in the network's outputs and its parameters. In MacKay (1992~)information-based objective functions for active learning were discussed within the same framework. These three papers concentrated on interpolation (regression) problems. Neural networks can also be trained to perform classification tasks.' This paper will show that the Bayesian framework for model comparison can be applied to these problems too. *Current address: Darwin College, Cambridge CB3 9EU, U.K. 'In regression the target variables are real numbers, assumed to include additive errors; in classification the target variables are discrete class labels.

Neural Computation 4,720-736 (1992)

@ 1992 Massachusetts Institute of Technology

Evidence Framework Applied to Classification Networks

721

Assume that a set of candidate classification models is fitted to a data set, using standard methods. Three aspects of the use of classifiers can then be distinguished:

1. The individual classification models are used to make predictions about new targets. 2. The alternative models are ranked in the light of the data. 3. The expected utility of alternative new data points is estimated for the purpose of "query learning" or "active data selection." This paper will present Bayesian ideas for these three tasks. Other aspects of classifiers use such as prediction of generalizatim ability are not addressed. First let us review the framework for supervised adaptive classification. 1.1 Derivation of the Objective Function G = Ct In p . The same notation and conventions will be used as in MacKay (1992a,b). Let the data set be D = {x("),t,}, rn = 1 . . . N. In a classification problem, each target t, is a binary (0/1) variable [more than two classes can also be handled (Bridle 1989)], and the activity of the output of a classifier is viewed as an estimate of the probability that t = 1. It is assumed that the classification problem is noisy, that is, repeated sampling at the same x would produce different values of t with certain probabilities; those probabilities, as a function of x, are the quantities that a discriminative classifier is intended to model. It is well known that the natural objective function in this case is an information-based distance measure, rather than the sum of squared errors (Bridle 1989; Hinton and Sejnowski 1986; Hopfield 1987; Solla et al. 1988). A classification model '? consists i of a specification of its architecture A and the regularizer R for its parameters w. When a classification model's parameters are set to a particular value, the model produces an output y ( x ;w, A) between 0 and 1, which is viewed as the probability P(t = 1 1 x , w, A). The likelihood, i.e. the probability of the data' as a function of w, is then

P ( D I W,A)

==

Hy''''(1 - Y)'-''~

=

expG(D I w,A)

in

where G(D Iw,A)=Ct,,logy+(1-tm)log(l-y)

(1.1)

m

2Strictly this is the probability of {t,n}given {x'")}, w, A; the density over {x} is not modeled by the "discriminative" classifiers discussed in this paper.

David J. C. MacKay

722

This is the probabilistic motivation for the cross-entropy objective function Cplogq/p. Now if we assign a prior over alternative parameter vectors w,

where Ek) is a cost function for a subset (c) o f the weights and cy, is the associated regularization constant (see MacKay 1992b), we obtain a posterior: exp( -C,a,E$ + G) (1.3) ZM where Zw and ZM are the appropriate normalizing constants. Thus the identical framework is obtained to that in MacKay (1992b) with -G replacing the term PED. Note that in contrast to the framework for regression in MacKay (1992b) there is now no free parameter [j and no Z D ( a ) . If, however, a teacher were to supply probability estimates t instead of binary targets, then a constant equivalent to /3 would appear, expressing the precision of the teacher's estimates. This constant would correspond to the effective number of observations on which the teacher's opinion is based. The calculation of the gradient and Hessian of G is as easy as for a quadratic E D , if the output unit's activation function is the traditional logisticf(a) = l/(l+e-'), or the generalized "softmax" in the case of more than two classes (Bridle 1989). The appropriateness of a logistic output function for a classifier is well known; it is the function that converts a log probability ratio a into a probability f ( a ) .

P(w 1 D , {ac}.A, R)=

1.1.1 Gradient. If y[x(")']=f[a(x("))]as defined above, the gradient of G with respect to the parameters w is

1.1.2 Hessian. The Hessian can be analytically evaluated (Bishop 1992), but a useful approximation neglecting terms in $a/d2w is

VVG

-

Cf'g(m)gT,n,

(1.5)

m

where f' = Bf/aa. This approximation is expected to be adequate for the evaluation of error bars, for use in data selection and for the evaluation of the number of well-determined parameters y. A more accurate evaluation of the Hessian is probably needed for estimation of the evidence. In this paper's demonstrations, the Hessian is evaluated using second differences, i.e., numerical differentiation of VG with respect to w.

Evidence Framework Applied to Classification Networks

723

1.2 Validity of Approximations. On account of the central limit theorem, we expect the posterior distribution to converge to a set of locally gaussian peaks with increasing quantities of data. However, the quadratic approximation to G is expected to converge more slowly than the quadratic approximation to E D , the error function for regression models, because (1) G is not a quadratic function even for a linear model [a model for which a = C W ~ $ ~ ( Xeach ) ] : term in G has the large scale form of a ramp function; and (2) only inputs that fall in the “bend” of the ramp contribute curvature to G. If we have the opportunity for active data selection we could improve the convergence of this quadratic approximation by selecting inputs that are expected to contribute maximal curvature. A related data selection criterion is derived in Section 4. 2 Every Classifier Should Have

Two Sets of Outputs

Consider a classifier with output y(x;w)= f[a(x;w)]. Assume that we receive data D and infer the posterior probability of the parameters w (i.e., we perform ”learning”). Now if we are asked to make predictions with this classifier, it is common for the most probable parameter vector WMP to be used as the sole representative of the posterior distribution. This strategy seems unwise, however, since there may be regions in input space where the posterior ensemble is very uncertain about what the class is; in such regions the output of the network should be y N 0.5 (assuming equiprobable classes a priori), whereas typically the network with parameters WMP will give a more extreme, unrepresentative, and overconfident output. The error bars on the parameters should be taken into account when predictions are made. In regression problems, it is also important to calculate error bars on outputs, but the problem is more acute in the case of classification because, on account of the nonlinear output, the mean output over the posterior distribution is not equal to the most probable network’s output. To obtain an output representative of the posterior ensemble of networks around WMP, we need to moderate the output of the most probable network in relation to the error bars on W M ~ . Of course this idea of averaging over the hidden parameters is not new: marginalization goes back to Laplace. More recently, and in a context closer to the present one, the same message can be found for example in Spiegelhalter and Lauritzen (1990). But it seems that most practitioners of adaptive classification do not currently use marginalization. I suggest that any classifier should have two sets of outputs. The first set would give the usual class probabilities corresponding to W M ~ , y(x; WMP); these outputs would be used for learning, i.e., for calculating the error signals for optimization of WMP. The second set would be the moderated outputs y[x;P(w I D)]= J dkwy(x;w)P(wI D);these outputs would be used for all other applications, e.g., prediction, evaluation of

David J. C. MacKay

724

test error, and for evaluating the utility of candidate data points (Section 4). Let us now discuss how to calculate the moderated outputs. It will then be demonstrated that these outputs can indeed provide better estimates of class probabilities. 2.1 Calculating the Moderated Outputs. If we assume a locally gaussian posterior probability distribution3 over w =: WMP + Aw, P(w I D ) N P ( w M ~exp( ) -1/2 Aw~AAw),and if we assume that the activation a(x; w) is a locally linear function of w with aa/dw = g, then for any x, the activation a is approximately gaussian distributed:

1 P ( a ( x )I D ) = Normal(aMP,s2) = -exp

w

[-

(a - aMP)2

2s2

]

(2.1)

where aMP= a(x; WMP) and s2 = gTAp'g.This means that the moderated output is

J

P ( f = 1 1 x , D ) = .41i(aMp,s2) E daf(a)Normal(aMP,s2)

(2.2)

This is to be contrasted with the most probable network's output, y(x; WMP) = f ( a M p ) . The integral of a sigmoid times a gaussian cannot be solved analytically; here I suggest a simple numerical approximation to it: $ ( a M P , s2) N 4 ( a M P , S*)

= f[.(s)n"P]

(2.3)

l/J1+.rrS2/8.

with IE. = This approximation is not globally accurate over (aMp,s2),(for large s2 > a the function should tend to an error function, not a logistic) but it breaks down gracefully. The value of K was chosen so that the approximation has the correct gain at aMP= 0, as s2 -+ m. A representative of this approximation is given in Figure 1, which compares 4 and 4' with numerical evaluations of $ and I$. A similar approximation in terms of the error function is suggested in Spiegelhalter and Lauritzen (1990). If the output is immediately used to make a (0/1) decision, then the use of moderated outputs will make no difference to the performance of the classifier (unless the costs associated with error are asymetrical), since both functions pass through 0.5 at aMp= 0. But moderated outputs will make a difference if a more sophisticated penalty function is involved. In the following demonstration the performance of a classifier's outputs is measured by the value of G achieved on a test set. A model classification problem with two input variables and two possible classes is shown in Figure 2a. Figure 2b illustrates the output of a typical trained network, using its most probable parameter values. Figure 2c shows the moderated outputs of the same network. Notice how the moderated output is similar to the most probable output in regions where 3Conditioning variables such as A, R,{aC}will be omitted in this section, since the emphasis is not on model comparison,

Evidence Framework Applied to Classification Networks

725

Figure 1: Approximationto the moderated probability. (a) The function $(a, s2), evaluated numerically. In (b) the functions $(a, s2) and $(a, s2) defined in the text are shown as a function of a for s2 = 4. In (c), the difference 4 - $ is shown for the same parameter values. In (d), the breakdown of the approximation is emphasized by showing log$’ and log$’ (derivatives with respect to a). The errors become significant when a >> s.

the data are dense. In contrast, where the data are sparse, the moderated output becomes significantly less certain than the most probable output; this can be seen by the widening of the contours. Figure 2d shows the correct posterior probability for this problem given the knowledge of the true class densities. Several hundred neural networks having two inputs, one hidden layer of sigmoid units and one sigmoid output unit were trained on this problem. During optimization, the second weight decay scheme of MacKay (1992b)was used, using independent decay rates for each of three weight classes: hidden weights, hidden unit biases, and output weights and biases. This corresponds to the prior that models the weights in each class as coming from a gaussian; the scales of the gaussians for different classes are independent and are specified by regularizing constants a,. Each regularizing constant is optimized on line by intermittently updating it to its most probable value as estimated within the “evidence” framework. The prediction abilities of a hundred networks using their ”most probable’’ outputs and using the moderated outputs suggested above are compared in Figure 3. It can be seen that the predictions given by the moderated outputs are in nearly all cases superior. The improvement is most substantial for underdetermined networks with relatively poor performance. In a small fraction of the solutions however, especially among the best solutions, the moderated outputs are found to have slightly but significantly inferior performance.

726

David J. C. MacKay

Figure 2: Comparison of most probable outputs and moderated outputs. (a) The data set. The data were generated from six circular gaussian distributions, three gaussians for each class. The training sets for the demonstrations use between 100 and 1000 data points drawn from this distribution. (b) (upper right) ”Most probable” output of an eight hidden unit network trained on 100 data points. The contours are equally spaced between 0.0 and 1.0. (c) (lower left) “Moderated” output of the network. Notice that the output becomes less certain compared with the most probable output as the input moves away from regions of high training data density. (d) The true posterior probability, given the class densities that generated the data. The viewpoint is from the upper right corner of (a). In (b,c,d) a common gray scale is used, linear from 0 (dark gray) to 1 (light gray).

3 Evaluating the Evidence Having established how to use a particular model H = {d, R} with given regularizing constants {a,} to make predictions, w e now turn to the question of model comparison. As discussed in MacKay (1992a1, three levels

Evidence Framework Applied to ClassificationNetworks

U U Ll

m

3600

-

.

solutions

-

0

X‘Y

3400

-

3200

-

3000

-

/’

Y

0

Alternati+

727

0

U 3

,/

0

,

,/b /‘

0

0

0

0 0

0

C

:: U

E

a I:

2500 3000 3500 4000 4500 Test error for most probable parameters

Figure 3: Moderation is a good thing! The training set for all the networks contained 300 data points. For each network, the test error of the ”most probable” outputs and the “moderated” outputs were evaluated on a test set of 5000 data points. The test error is the value of G. Note that for most solutions, the moderated outputs make better predictions. of inference can be distinguished: parameter estimation, regularization constant determination, and model c ~ m p a r i s o n .The ~ second two levels of inference both require “Occam’s razor”; that is, the solution that best fits the data is not the most plausible model, and we need a way to balance goodness of fit against complexity. Bayesian inference embodies such an Occam’s razor automatically. At the first level, a model ‘FI, with given regularizing constants {a,}, is fitted to the data D. This involves inferring what value the parameters w should probably have. Bayes‘ rule for this level of inference has the form:

Throughout this paper this posterior is approximated locally by a gaussian:

where Aw = w

-wM~,

M(w) = Cco,Ef - G, and A = VVM.

4The use of a specified model to predict the class of a datum can be viewed as the zeroeth level of inference.

David J. C . MacKay

728

At the second level of inference, the regularizing constants are optimized:

The data-dependent term P(D I {ac},Y) is the “evidence,” the normalizing constant from equation 3.1. The evaluation of this quantity and the optimization of the parameters {a,} is accomplished using a framework due to Gull and Skilling, discussed in detail in MacKay (1992a,b). Finally, at the third level of inference, the alternative models are compared: pm I D)0;

w I 3-1)P(3-1)

(3.4)

Again, the data’s opinion about the alternatives is given by the evidence from the previous level, in this case P(D I 3-1). Omitting the details of the second level of inference, since they are identical to the methods in MacKay (1992b), this demonstration presents the final inferences, the evidence for alternative solutions. The evidence is evaluated within the gaussian approximation from the properties of the ”most probable” fit WMP, and the error bars A-’, as described in MacKay (1992a). Figure 4 shows the test error (calculatedusing the moderated outputs) of the solutions against the data error, and the ”Occam’s razor” problem can be seen: the solutions with smallest data error do not generalize best. Figure 5 shows the log evidence for the solutions against the test error, and it can be seen that a moderately good correlation is obtained. The correlation is not perfect. It is speculated that the discrepancy is mainly due to inaccurate evaluation of the evidence under the quadratic approximation, but further study is needed here. Finally, Figure 6 explores the dependence of the correlation between evidence and generalization on the amount of data. It can be seen that the correlation improves as the number of data points in the test set increases. 4 Active Learning

Assume now that we have the opportunity to select the input x where a future datum will be gathered (“query learning”). Several papers have suggested strategies for this active learning problem, for example, Hwang etal. (1991)propose that samples should be made on and near the current decision boundaries. This strategy and that of Baum (1991) are both human-designed strategies and it is not clear what objective function if any they optimize, nor is it clear how the strategies could be improved. In this paper, as in MacKay (199213, the philosophy will be to derive a criterion from a defined sensible objective function that measures how useful a datum is expected to be. This criterion may then be used as

Evidence Framework Applied to Classification Networks

2

5000

-

4500

-

4000

-

3500

-

3000

-

2500

-

729

a

D

o

-

U

W

u H 01

p

810 U

u U

$

s

OP

0

Figure 4: Test error versus data error. This figure illustrates that the task of ranking solutions to the classification problem requires Occam’s razor; the solutions with smallest data error do not generalize best.

5000 U

- 0

%

Alternative solutions

-

0

4500 -

U

W CI vl

4000

-

0 0

m

H

2 u

o

3500 -

U

m

U

B

m

a BJO

3000

-

2500 -

0

0

.%.,

0

n % .&

-

Figure 5: Test error versus evidence. Each solution was found using the same training set of N = 300 data points. All solutions in which a symmetry was detected among the hidden units were omitted from this graph because the evidence evaluation for such solutions is unreliable.

David J. C. MacKay

730

4500 4000

t

L I D

3500 3000 2500

a)

-90

-85

-80 -75 -10 Log E v l d e n c e

-65

-6b)

-340 -320 -300 -280 -260 -240 -220 -200 -100 Lag Evidence

Figure 6: Correlation between test error and evidence as the amount of data varies. (a) N = 150 data points. (b) N = 600 data points. (Compare Figure 5, for which N = 300.) For comparison, the number of parameters in a typical (10 hidden unit) network is 41. Note that only about 25% of the data points fall in informative decision regions; so the effective number of data points is smaller in each case; bear in mind also that each data point consists only of one bit. All solutions in which a symmetry was detected among the hidden units were omitted because the evidence evaluation for such solutions is unreliable. a guide for query learning, or for the alternative scenario of pruning uninformative data points from a large data set. 4.1 Desiderata. Let us criticize Hwang et al.'s strategy to try to establish a reasonable objective function. The strategy of sampling on decision boundaries is motivated by the argument that w e are unlikely to gain information by sampling in a region where we are already confident of the correct classification. But similarly, if we have already sampled a great deal on one particular boundary then we do not gain useful information by repeatedly sampling there either, because the location of the boundary has been established! Repeated sampling at such locations generates data with large entropy that are "informative" in the same way that white noise is informative. There must be more to the utility of a sample than its distance from a decision boundary. We would prefer to sample near boundaries whose location has not been well determined, because this would probably enable us to make more precise predictions there. Thus we are interested in measurements which convey mutual information about the unknowns that we are interested in. A second criticism is that a strategy that samples only near existing boundaries is not likely to make new discoveries; a strategy that also samples near potential boundaries is expected to be more informative. A final criticism is that to be efficient, a strategy should take into account

Evidence Framework Applied to Classification Networks

73 1

how influential a datum will be: some data may convey information about the discriminant over a larger region than others. So we want an objective function that measures the global expected informativeness of a datum. 4.2 Objective Function. This paper will study the "mean marginal information." This objective function was suggested in MacKay (1992~1, and a discussion of why it is probably more desirable than the joint information is given there. To define this objective function, we first have to define a region of interest. (The objective of maximal information gain about the model's parameters without a region of interest would lead us to sample at unsampled extremes of the input space.) Here this region of interest will be defined by a set of representative points x('), u = 1. . . V, with a normalized distribution P, on them. P, can be interpreted as the probability that we will be asked to make a prediction at x('). [The theory could be worked out for the case of a continuous region defined by a density p(x), but the discrete case is preferred since it relates directly to practical implementation.] The marginal entropy of a distribution over w, P(w), at one point x(,) is defined to be

s i ) = yu logy, + (1 - y u ) log(1 - yu)

(4.1)

where y, = y[x(");P(w)]is the average output of the classifier over the ensemble P(w). Under the gaussian approximation for P(w), y, is given by the moderated output (equation 2.2), and may be approximated by @(a:', s!) (equation 2.3). The mean marginal entropy is

The sampling strategy studied here is to maximize the expected change in mean marginal entropy. (Note that our information gain is minus the change in entropy.) 4.3 Estimating Marginal Entropy Changes. Let a measurement be made at x. The result of this measurement is either t = 1 or t = 0. Assuming that our current model, complete with gaussian error bars, is correct, the probability of t = 1 is $[uMp(x),s2(x)] 11 @(aMP, s2). We wish to estimate the average change in marginal entropy of t, at x(") when this measurement is made. This problem can be solved by calculating the joint probability distribution Pjt, t u )of t and f,, then finding the mutual information between the two variables. The four values of P ( t , t,) have the form

P(t=l,t,=l)=//dada,f(a)f(a,)-exp Z

David J. C. MacKay

732

+

+

where AaT = (Aa, Aa,) and the activations a = aMP Aa and a, = ayp Aa, are assumed to have a gaussian distribution with covariance matrix

(4.4) The normalizing constant is Z = 2nss,(l - p2)1/2. The expected change in entropy of t, is

E ( A S ~ I)t ) = q q t , t,)] - S[P(~)I - sp(t,)l

(4.5)

Notice that this mutual information is symmetric in t and t,. We can approximate E( Ask) I t) by Taylor-expanding P( t , t,) about independence ( p = 0). The first order perturbation to P(t,t,) introduced by p can be written in terms of a single variable c:

P ( t = 1,t, P ( t = 1,t, P ( t = 0, t, P ( t = 0, t,

1) = P ( t = l)P(t, = 1) + c = 0) = P(t = l)P(t, = 0) - c = 1) = P(t = O)P(t, = 1) - c = 0) = P ( t = O)P(t, = 0) + c =

(4.6)

Taylor-expanding equation 4.5, we find

E ( A S ~ I)t ) 21 -

1

P ( t = l)P(t, = l)P(t = O)P(t, = 0)

c2/2

(4.7)

Finally, we Taylor-expand equation 4.3 so as to obtain the dependence of c on the correlation between the activations. The derivative of P(t = 1,t, = 1) with respect to p at p = 0 is d

-P(t

&J

=

l,t,

=

1)

=

11

xdada,f(a)f(a,)---

= s$'(a

MI'

2

MP

Aa aa, ss,

2

, s ) s,?ll(au rsu)

where $ is the moderated probability defined in equation 2.3 and $' denotes a$/&. This yields c N p ,Pa(t

dP

= 1,t, = 1) = gTA-'gcu,$'(aMp,S ~ ) $ ( U ,MP ,s,)2

(4.8)

Substituting this into equation 4.7, we find

E ( A S ~ 1)t ) = -

(gTA-'g(,,))'$'(aMP,s ' ) ~ $'(UYI', ~ 2 2P(t = l)P(tU= l)P(t = O)P(t, = 0)

)~

(4.9)

Evidence Framework Applied to Classification Networks

733

Assuming that the approximation $ 'v q!~ 5 f [ ~ ( s ) a ~ is ' ] good, we can numerically approximate d$(uMP,s')/13u by K ( s ) ~ ' [ K ( s ) u ~ " ] . ~ Using f ' = f(1 - f ) we obtain

E(AS&) I f) N - K ( s ) ~ ~ ; ( s , ) ' ~ ' [ K ( s MP ) u If / [ K ( s , ) u ~ ~ "(gTA-'g[,))'/2 ] (4.10)

The two f ' terms in this expression correspond to the two intuitions that sampling near decision boundaries is informative, and that we are able to gain more information about points of interest if they are near boundaries. The term (gTA-'g[,))' modifies this tendency in accordance with the desiderata. The expected mean marginal information gain is computed by adding up the Ask's over the representative points x(I0. The resulting function is plotted on a grey scale in Figure 7, for the network solving the toy problem described in Figure 2. For this demonstration the points of interest x ( I 1 ) were defined by drawing 100 input points at random from the test set. A striking correlation can be seen between the regions in which the moderated output is uncertain and regions of high expected information gain. In addition the expected information gain tends to increase in regions where the training data were sparse. Now to the negative aspect of these results. The regions of greatest expected information gain lie outside the region of interest to the right and left; these regions extend in long straight ridges hundreds of units away from the data. This estimation of utility, which reveals the "hyperplanes" underlying the model, seems unreasonable. The utility of points so far from the region of interest, if they occurred, could not really be so high. There are two plausible explanations of this. It may be that the Taylor approximations used to evaluate the mean marginal information are at fault, in particular equation 4.8. Or as discussed in MacKay (19923, the problem might arise because the mean marginal information estimates the utility of a point assuming that the model is true; if we assume that the classification surface really can be described in terms of hyperplanes in the input space, then it may be that the greatest torque on those planes can be obtained by sampling away from the core of the data. Comparison of the approximation 4.10 with numerical evaluations of AS&) indicates that the approximation is never more than a factor of two wrong. Thus the latter explanation is favored, and we must tentatively conclude that the mean marginal information gain is likely to be most useful only for models well matched to the real world. 5This approximation becomes inaccurate where uMP >> s >> 1 (see Fig. lc). Because of this it might be wise to use numerical integration then implement AS:) in look-up tables.

David J. C. MacKay

734

-20

-10

0

10

20

30

Figure 7 Demonstration of expected mean marginal information gain. The mean marginal information gain was computed for the network demonstrated in Figure 2b,c. The region of interest was defined by 100 data points from the test set. The gray level represents the utility of a single observation as a function of where it is made. The darkest regions are expected to yield little information, and white corresponds to large expected information gain. The contours that are superposed represent the moderated output of the network, as shown in Figure 2c. The mean marginal information gain is quantified: the gray scale is linear from 0 to 0.0025 nats.

5 Discussion 5.1 Moderated Outputs. The idea of moderating the outputs of a classifier in accordance with the uncertainty of its parameters should have wide applicability, for example, to hidden Markov models for speech recognition. Moderation should be especially important where a classifier is expected to extrapolate to points outside the training region. There is presumably a relationship of this concept to the work of Seung et al. (1991) on generalization ”at nonzero temperature.”

Evidence Framework Applied to Classification Networks

735

If the suggested approximation to the moderated output and its derivative is found unsatisfactory, a simple brute force solution would be to set up a look-up table of values of +(a,s2) and +'(u,s2). It is likely that an implementation of marginalization that will scale up well to large problems will involve Monte Carlo methods (Neal 1992). 5.2 Evidence. The evidence has been found to be well correlated with generalization ability. This depends on having a sufficiently large amount of data. There remain open questions, including what the theoretical relationship between the evidence and generalization ability is, and how large the data set must be for the two to be well correlated, how well these calculations will scale up to larger problems, and when the quadratic approximation for the evidence breaks down. 5.3 Mean Marginal Information Gain. This objective function was derived with active learning in mind. It could also be used for selection of a subset of a large quantity of data, as a filter to weed out fractions of the data that are unlikely to be informative. Unlike Plutowski and White's (1991) approach this filter depends only on the input variables in the candidate data. A strategy that selectively omits data on the basis of their output values would violate the likelihood principle and risk leading to inconsistent inferences. A comparison of the mean marginal information gain in Figure 7 with the contours of the most probable networks output in Figure 2b indicates that this proposed data selection criterion offers some improvements over the simple strategy of just sampling on and near decision boundaries: the mean marginal information gain shows a plausible preference for samples in regions where the decision boundary is uncertain. On the other hand, this criterion may give artifacts when applied to models that are poorly matched to the real world. How useful the mean marginal information gain will be for real applications remains an open question.

Acknowledgments This work was supported by a Caltech Fellowship and a Studentship from SERC, UK. References Baum, E. 8. 1991. Neural net algorithms that learn in polynomial time from examples and queries. I E E E Truns. Neurul Networks 2(1), 5-19. Bishop, C. M. 1992. Exact calculation of the Hessian matrix for the multilayer perceptron. Neurul Comp. 4(4), 494-501.

David J. C. MacKay

736

Bridle, J. S. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing: Algorithms, Architectures and Applications, F. Fougelman-Soulie and J. Hkrault, eds., pp. 227-236. Springer-Verlag, Berlin. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, Rumelhart et al., eds., pp. 282317. The MIT Press, Cambridge. Hopfield, J. J. 1987. Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Natl. Acad. Sci. U.S.A. 84,8429-8433. Hwang, J-N., Choi, J. J., Oh, S., and Marks, R. J., I1 1991. Query-based learning applied to partially trained multilayer perceptrons. IEEE Trans. Neural Networks 2(1), 131-136. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Comp. 4(3), 415-447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backprop networks. Neural Comp. 4(3), 448-472. MacKay, D. J. C. 1992~.Information-based objective functions for active data selection. Neural Comp. 4(4), 589-603. Neal, R. M. 1992. Bayesian training of backpropagation networks by the Hybrid Monte Carlo Method. University of Toronto CRG-TR-92-1. Plutowski, M., and White, H. 1991. Active selection of training examples for network learning in noiseless environments. Dept. Computer Science, UCSD, TR 90-011. Seung, H. S., Sompolinsky, H., and Tishby, N. 1991. Statistical mechanics of learning from examples. Preprint, Racah Institute of Physics, Israel. Solla, S. A., Levin, E., and Fleisher, M. 1988. Accelerated learning in layered neural networks. Complex Syst. 2, 625-640. Spiegelhalter, D. J., and Lauritzen, S. L. 1990. Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579-605. ~~

Received 20 November 1991; accepted 18 February 1992.

This article has been cited by: 2. Ashok K. Mishra, Mehmet Özger, Vijay P. Singh. 2010. Wet and dry spell analysis of Global Climate Model-generated precipitation using power laws and wavelet transforms. Stochastic Environmental Research and Risk Assessment . [CrossRef] 3. J V Marcos, R Hornero, D Álvarez, I T Nabney, F del Campo, C Zamarrón. 2010. The classification of oximetry signals using Bayesian neural networks to assist in the detection of obstructive sleep apnoea syndrome. Physiological Measurement 31:3, 375-394. [CrossRef] 4. Zahra Moravej, Mohammad Pazoki, Ali Akbar Abdoos. 2010. Wavelet transform and multi-class relevance vector machines based recognition and classification of power quality disturbances. European Transactions on Electrical Power n/a-n/a. [CrossRef] 5. Arvind Tolambiya, S. Venkatraman, Prem K. Kalra. 2010. Content-based image classification with wavelet relevance vector machines. Soft Computing 14:2, 129-136. [CrossRef] 6. D. P. Vetrov, D. A. Kropotov, N. O. Ptashko. 2009. An efficient method for feature selection in linear regression based on an extended Akaike’s information criterion. Computational Mathematics and Mathematical Physics 49:11, 1972-1985. [CrossRef] 7. Paulo J. G. Lisboa, Terence A. Etchells, Ian H. Jarman, Corneliu T. C. Arsene, M. S. Hane Aung, Antonio Eleuteri, Azzam F. G. Taktak, Federico Ambrogi, Patrizia Boracchi, Elia Biganzoli. 2009. Partial Logistic Artificial Neural Network for Competing Risks Regularized With Automatic Relevance Determination. IEEE Transactions on Neural Networks 20:9, 1403-1416. [CrossRef] 8. D. Kropotov, N. Ptashko, D. Vetrov. 2009. Relevant regressors selection by continuous AIC. Pattern Recognition and Image Analysis 19:3, 456-464. [CrossRef] 9. D. Kropotov, D. Vetrov. 2009. General solutions for information-based and Bayesian approaches to model selection in linear regression and their equivalence. Pattern Recognition and Image Analysis 19:3, 447-455. [CrossRef] 10. Ashok K. Mishra, Mehmet Özger, Vijay P. Singh. 2009. Trend and persistence of precipitation under climate change scenarios for Kansabati basin, India. Hydrological Processes 23:16, 2345-2357. [CrossRef] 11. Shirish Shevade, S. Sundararajan. 2009. Validation-Based Sparse Gaussian Process Classifier DesignValidation-Based Sparse Gaussian Process Classifier Design. Neural Computation 21:7, 2082-2103. [Abstract] [Full Text] [PDF] [PDF Plus] 12. Huanhuan Chen, P. Tino, Xin Yao. 2009. Probabilistic Classification Vector Machines. IEEE Transactions on Neural Networks 20:6, 901-914. [CrossRef]

13. J.G. Hincapie, R.F. Kirsch. 2009. Feasibility of EMG-Based Neural Network Controller for an Upper Extremity Neuroprosthesis. IEEE Transactions on Neural Systems and Rehabilitation Engineering 17:1, 80-90. [CrossRef] 14. Andrew E. Mercer, Chad M. Shafer, Charles A. Doswell, Lance M. Leslie, Michael B. Richman. 2009. Objective Classification of Tornadic and Nontornadic Severe Weather Outbreaks. Monthly Weather Review 137:12, 4355. [CrossRef] 15. A. K. Mishra, Vijay P. Singh. 2009. Analysis of drought severity-area-frequency curves using a general circulation model and scenario uncertainty. Journal of Geophysical Research 114:D6. . [CrossRef] 16. Tetsuya Shimokawa, Tadanobu Misawa, Kyoko Suzuki. 2008. Neural representation of preference relationships. NeuroReport 19:16, 1557-1561. [CrossRef] 17. Stefan Lessmann, Bart Baesens, Christophe Mues, Swantje Pietsch. 2008. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Transactions on Software Engineering 34:4, 485-496. [CrossRef] 18. Shian-Chang Huang, Tung-Kuang Wu. 2008. Combining wavelet-based feature extractions with relevance vector machines for stock index forecasting. Expert Systems 25:2, 133-149. [CrossRef] 19. Satohiro Tajima, Masato Inoue, Masato Okada. 2008. Bayesian-Optimal Image Reconstruction for Translational-Symmetric Filters. Journal of the Physical Society of Japan 77:5, 054803. [CrossRef] 20. E. Ronchi, S. Conroy, E. A. Sundén, G. Ericsson, A. Hjalmarsson, C. Hellesen, M. G. Johnson, M. Weiszflog, JET-EFDA Contributors. 2008. A neural networks framework for real-time unfolding of neutron spectroscopic data at JET. Review of Scientific Instruments 79:10, 10E513. [CrossRef] 21. Chang Kook Oh, James L. Beck, Masumi Yamada. 2008. Bayesian Learning Using Automatic Relevance Determination Prior with an Application to Earthquake Early Warning. Journal of Engineering Mechanics 134:12, 1013. [CrossRef] 22. Cristian Sminchisescu, Atul Kanaujia, Dimitris N. Metaxas. 2007. BM³E : Discriminative Density Propagation for Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:11, 2030-2044. [CrossRef] 23. BegÜm Demir, Sarp Erturk. 2007. Hyperspectral Image Classification Using Relevance Vector Machines. IEEE Geoscience and Remote Sensing Letters 4:4, 586-590. [CrossRef] 24. Muhammad Aqil, Ichiro Kita, Akira Yano, Soichi Nishiyama. 2007. Neural Networks for Real Time Catchment Flow Modeling and Prediction. Water Resources Management 21:10, 1781-1796. [CrossRef] 25. Andrew I. Schein, Lyle H. Ungar. 2007. Active learning for logistic regression: an evaluation. Machine Learning 68:3, 235-265. [CrossRef]

26. Faming Liang. 2007. Annealing stochastic approximation Monte Carlo algorithm for neural network training. Machine Learning 68:3, 201-233. [CrossRef] 27. Alexander Schmolck, Richard Everson. 2007. Smooth relevance vector machine: a smoothness prior extension of the RVM. Machine Learning 68:2, 107-135. [CrossRef] 28. Feng Zheng, Yejun Qin, Kun Chen. 2007. Sensitivity map of laser tweezers Raman spectroscopy for single-cell analysis of colorectal cancer. Journal of Biomedical Optics 12:3, 034002. [CrossRef] 29. K. Kobayashi, F. Komaki. 2006. Information Criteria for Support Vector Machines. IEEE Transactions on Neural Networks 17:3, 571-577. [CrossRef] 30. G.C. Cawley, N.L.C. Talbot, G.J. Janacek, M.W. Peck. 2006. Sparse Bayesian Kernel Survival Analysis for Modeling the Growth Domain of Microbial Pathogens. IEEE Transactions on Neural Networks 17:2, 471-481. [CrossRef] 31. S. Gareth Pierce, Yakov Ben-Haim, Keith Worden, Graeme Manson. 2006. Evaluation of Neural Network Robust Reliability Using Information-Gap Theory. IEEE Transactions on Neural Networks 17:6, 1349-1361. [CrossRef] 32. Faming Liang . 2005. Evidence Evaluation for Bayesian Neural Networks Using Contour Monte CarloEvidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo. Neural Computation 17:6, 1385-1410. [Abstract] [PDF] [PDF Plus] 33. Shovan K. Majumder, Nirmalya Ghosh, Pradeep K Gupta. 2005. Relevance vector machine for optical diagnosis of cancer. Lasers in Surgery and Medicine 36:4, 323-333. [CrossRef] 34. Shuang Yang, Antony Browne. 2004. Neural network ensembles: combining multiple models for enhanced performance using a multistage approach. Expert Systems 21:5, 279-288. [CrossRef] 35. S. Sigurdsson, P.A. Philipsen, L.K. Hansen, J. Larsen, M. Gniadecka, H.C. Wulf. 2004. Detection of Skin Cancer by Classification of Raman Spectra. IEEE Transactions on Biomedical Engineering 51:10, 1784-1793. [CrossRef] 36. Vasily Belokurov, N. Wyn Evans, Yann Le Du. 2004. Light-curve classification in massive variability surveys - II. Transients towards the Large Magellanic Cloud. Monthly Notices of the Royal Astronomical Society 352:1, 233-242. [CrossRef] 37. D. Chakraborty, N.R. Pal. 2003. A novel training scheme for multilayered perceptrons to realize proper generalization and incremental learning. IEEE Transactions on Neural Networks 14:1, 1-14. [CrossRef] 38. T. Van Gestel , J. A. K. Suykens , G. Lanckriet , A. Lambrechts , B. De Moor , J. Vandewalle . 2002. Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant AnalysisBayesian Framework for Least-Squares Support Vector

Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis. Neural Computation 14:5, 1115-1147. [Abstract] [PDF] [PDF Plus] 39. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 40. S. Chen, S.R. Gunn, C.J. Harris. 2001. The relevance vector machine technique for channel equalization application. IEEE Transactions on Neural Networks 12:6, 1529-1532. [CrossRef] 41. Tony A. Plate , Joel Bert , John Grace , Pierre Band . 2000. Visualizing the Function Computed by a Feedforward Neural NetworkVisualizing the Function Computed by a Feedforward Neural Network. Neural Computation 12:6, 1337-1353. [Abstract] [PDF] [PDF Plus] 42. S. Roberts, I. Rezek, R. Everson, H. Stone, S. Wilson, C. Alford. 2000. Automated assessment of vigilance using committees of radial basis function analysers. IEE Proceedings - Science, Measurement and Technology 147:6, 333. [CrossRef] 43. James Tin-Yau Kwok. 2000. The evidence framework applied to support vector machines. IEEE Transactions on Neural Networks 11:5, 1162-1173. [CrossRef] 44. A.K. Jain, P.W. Duin, Jianchang Mao. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 4-37. [CrossRef] 45. S. J. Roberts, W. D. Penny. 2000. Real-time brain-computer interfacing: A preliminary study using Bayesian learning. Medical & Biological Engineering & Computing 38:1, 56-61. [CrossRef] 46. D.J.C. Mackay, M.N. Gibbs. 2000. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks 11:6, 1458-1464. [CrossRef] 47. David J. C. MacKay . 1999. Comparison of Approximate Methods for Handling HyperparametersComparison of Approximate Methods for Handling Hyperparameters. Neural Computation 11:5, 1035-1068. [Abstract] [PDF] [PDF Plus] 48. Bernhard Schottky, David Saad. 1999. Journal of Physics A: Mathematical and General 32:9, 1605-1621. [CrossRef] 49. P.J. Edwards, A.F. Murray, G. Papadopoulos, A.R. Wallace, J. Barnard, G. Smith. 1999. The application of neural networks to the papermaking industry. IEEE Transactions on Neural Networks 10:6, 1456-1464. [CrossRef] 50. J.T.-Y. Kwok. 1999. Moderating the outputs of support vector machine classifiers. IEEE Transactions on Neural Networks 10:5, 1018-1031. [CrossRef] 51. S.J. Roberts, D. Husmeier, I. Rezek, W. Penny. 1998. Bayesian approaches to Gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:11, 1133-1142. [CrossRef]

52. H.H. Thodberg. 1996. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks 7:1, 56-72. [CrossRef] 53. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus] 54. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 55. Graeme Manson, Keith Worden, S. Gareth Pierce, Thierry DenoeuxUncertainty Analysis . [CrossRef]

Communicated by Fernando Pineda

Rotor Neurons: Basic Formalism and Dynamics Lars G i s l h Carsten Peterson Bo Soderberg Department of Theoretical Physics, University of Lund, Solvegatan 14A, 5-22362 Lund, Sweden

Rotor neurons are introduced to encode states living on the surface of a sphere in D dimensions. Such rotors can be regarded as continuous generalizations of binary (Ising) neurons. The corresponding mean field equations are derived, and phase transition properties based on linearized dynamics are given. The power of this approach is illustrated with an optimization problem-placing N identical charges on a sphere such that the overall repulsive energy is minimized. The rotor approach appears superior to other methods for this problem both with respect to solution quality and computational effort needed. 1 Background

Standard McCulloch-Pitts neurons are characterized by sigmoidal updating equations

vi = g(ui) = tanhui

(1.1)

where the local field u; is given by

and the inverse temperature 1/T sets the gain. The neurons are binary in the high gain ( T + 0) limit. In feed-back networks (Hopfield and Tank 1985) with a quadratic energy function in terms of binary neurons sir (1.3)

the iterative solutions of the mean field equations (equations 1.1 and 1.2) represent approximate minima to E for appropriate values of T, where v; =< si >T. In the more general case one has (1.4) Neural Computation 4,

737-745 (1992)

@ 1992 Massachusetts Institute of Technology

L. Gislen, C. Peterson, and B. Soderberg

738

In a series of papers, we have investigated the generalization of this approach to multistate (Potts) neurons, which are superior in situations where one wants only one of the si (q)to be "on" and the others "off." In effect, equation 1.1 is replaced by'

Such a constrained encoding turns out to be crucial for many optimization applications (Peterson and Soderberg 1989; Peterson 1990a). Potts neurons also play a crucial role in deformable templates or so-called elastic net methods (Durbin and Willshaw 1987). In feedforward networks with exclusive classifications Potts neuron encoding of the output layer could be profitable (Lonnblad et al. 1991). In the present paper we investigate the generalization from binary neurons to the case of a continuum of states on a D-dimensional sphere, and apply the method to the problem of optimal configuration of charges on a sphere. 2 Rotor Neurons

Consider the general problem of minimizing an energy function E ( s l , . . . , SN) with respect to a set of N D-dimensional unit vectors si (hereafter denoted rotors) Is11 = 1

(2.1)

A locally minimal configuration must satisfy S;

=

-ViE/IViEI

(2.2)

Local optimization consists of iterating these equations until convergence. This is in general not a very good method for finding the global minimum: the configurations easily get stuck in local minima. A more careful method is simulated annealing (Kirkpatrick et al. 1983), where a thermal distribution 0: exp(-E/T) is simulated. The temperature T is very slowly lowered, until a stable state results. This method is very time-consuming, if a good result is desired. For this kind of problem, we suggest a mean field theory (MET) rotor approach analogous to what is used for combinatorial optimization problems (Hopfield and Tank 1985; Peterson and Soderberg 1989). 2.1 Mean Field Equations. Consider a thermal distribution of configurations, characterized by the partition functioiz Z,

z=

J

e-E[sl/rdsl . . . dsN

'Using [0,1] representation rather than the [-1,1] of equation 1.1.

(2.3)

Rotor Neurons

739

where the simplified notation dsi means that the integration is to be performed only over the direction of sir and normalized such that Jdsi = 1. For simplicity, consider first a single integral I = J H ( s )ds, over the directions of a D-dimensional unit vector s, with H an arbitrary function. It can be rewritten as /H(s)ds

= /H(v)b(s - v)dsdv K

I

H(v)e".(*-")dsdvdu

(2.4)

Performing the s integral, one is left with dvdu .IH(v)e-V'U+F(U)

IK where u

=

(2.5)

lul and F(u) is defined by

1

F ( u ) = log eU" ds

(2.6)

For a D-dimensional integral, this evaluates to (2.7) where 1, are modified Bessel functions. Repeating this trick N times with the multiple integral Z , we obtain

Z K 1exp [-E[v]/T-xv;.u,+xF(ui) 1

i

1

dvldul . . . dvNduN (2.8)

Next, we seek a saddlepoint of the effective potential, appearing in the argument of the exponent in the integrand in equation 2.8, by demanding its derivatives with respect to u; and v, to vanish. This results in the mean field equations (2.9) (2.10) where u, = ui/u,. They give v, as the average of s, in the local field V,E(v). In Table 1, g ( u ) is listed for different values of D. When D = 1, equation 1.1 is recovered. The corresponding curves are shown in Figure 1. In the large D limit, the shape of g is given by lim g(Du) =

dG73-1

(2.11) 2u We regard this system as a neural network, with vi as a generalized neuron, ui as its input (local field), and g as a generalized sigmoid function. The obvious dynamics consists in iterating equations 2.9 and 2.10. The performance of the neural network thus defined will depend on T; this is to be discussed next. D-03

L. Gislbn, C . Peterson, and 8.Soderberg

740

Table 1: Proper g(u) for Different Dimensions D.

1.0

,

I

08

0.6

04

0.2

4.

2.

0.

6.

a.

LO.

Figure 1: Graphs of g ( u ) for different dimensions D. 2.2 Critical Temperature Estimation. From equation 2.10 we infer that for small ui, vi NN u;/D. Making the simplifying assumption that E is rotationally invariant, we can approximate E for small v with a quadratic form 1 (2.12) E = Eo - - C w j j V i 'v, O(V')

+

I,]

But then v1 = ... = VN = 0 is a fixpoint of the updating equations 2.9 and 2.10. Linearizing these for small 21, we obtain (2.13)

Rotor Neurons

741

Figure 2: Schematic evolution of a D = 2 rotor initialized close to the center, for T < T,. If the temperature is higher than the critical temperature

where Xminjmax are the extreme eigenvalues of w,this trivial fixpoint is stable under synchronous updating, and the system is in a symmetric phase. For a lower T , it becomes unstable, and the mean fields vi will be repelled by the origin. For low enough temperature they will stabilize close to the sphere v: = 1 (cf. Fig. 2). The dynamics is thus very different from that of local optimization and simulated annealing, where moves take place on the surface. For serial updating, things are similar, although T, is different. In the special case of a constant self-coupling wii = p, we have instead 1

- - max(X,,,, ,D

T

-/A)

(2.15)

Thus, for a large set of energy functions, we can estimate T, in advance. A good strategy is then to initialize the system close to the symmetry point, with T close to T,, and slowly anneal while the system settles. When finally a stable state is reached, a possible solution to the minimization problem is extracted.

L. Gislen, C. Peterson, and 8.Soderberg

742

For both types of updating, it turns out to be advantageous to be able to adjust the self-coupling to achieve maximum stability. This is done by adding a term -([j/2) Civj!to the energy. For a more detailed discussion of [j and T,, and of serial versus synchronous updating, the reader is referred to Peterson and Soderberg (1989). 3 Placing Charges on a Sphere

We now turn to the specific problem of the equilibrium configuration of N equal charges on a sphere ( D = 3). With Coulomb interaction between the charges the energy function is given by

where we for dynamic reasons have added a [ j term as discussed above. With this energy function the local field ui (cf. equation 2.9) is given by

with the corresponding updating equation (see Table 1) V, = ii,(cothu; - l / u l )

(3.3)

The critical temperature T, for this system in serial updating mode is, for reasonable values (cf. equation 2.14) 7-c =

(P + 3 / 3

(3.4)

The :i-term also affects the low temperature behavior, controlling the tendency for the system to remain in a local minimum. A good final configuration, with the charges uniformly distributed over the sphere, should be stable. This means that the updated uis should satisfy ul(t)' v,(t - 1) > 0

(3.5)

A necessary condition for accomplishing this can be derived for large N. The result is

The role of is thus twofold: it controls the phase transition and the dynamic behavior at low temperatures. Equipped with prior estimates of T, and Po, the algorithm for a given problem size can take the following "black box" form: 1. Compute T, and

Po according to equations 3.4 and 3.6.

Rotor Neurons

743

MFT SA 0

1

GD

100

10

Figure 3: Time consumption as a function of problem size (N)for the MFT rotor (MFT), gradient descent (GD), and simulated annealing (SA) algorithms. Both axes are logarithmic. The three lines correspond to N,N2, and N3, respectively.

2. lnitialize with v,"= 0.01 ' rand[-l,l). 3. Update all v,s in turn according to equations 3.2 and 3.3. 4. Decrease the temperature, T

-+ 0.95

. T.

5. Repeat from step 3, until the saturation (Peterson and Soderberg 1989) C,q / N > 0.99.

6. Extract configuration by setting s, = i!j1, Using this prescription we have computed configurations of 3, 5, 10,20, 30, and 100 charges, respectively. In Figure 3 the time consumption as a function of problem size is shown. As in the case of other MFT applications (Peterson and Soderberg 1989; Gislen et al. 1989, 19921, the number of iterations needed for convergence empirically shows no dependence on problem size. Hence, the time consumption scales as J@ for the MET rotor algorithm. As for the quality of the solution, the MET rotor model gives the correct solutions where these are explicitly known (N= 2, 3, 4, and 6). For larger problems we have compared the solutions with those from a gradient descent (GD) and a simulated annealing (SA) algorithm. In GD one charge at a time is moved on the sphere a step 0: -gradient/N. In SA, a maximum step size 0; 1/N was used, with the annealing schedule T 4 0.95. T . These algorithms were clocked when the energy was within 1%of the MFT rotor result. The time consumption to achieve this is

L. Gislbn, C. Peterson, and B. Soderberg

744

Figure 4: Evolutions of rotors for an N = 32 problem as T decreases. Open and filled dots represent charges placed in front and the back of the sphere, respectively. The two graphs are generated to provide a stereo effect. shown in Figure 3. In Figure 4 the evolution of the rotors for an N = 32 problem is shown. Comparing the MET rotor approach with the conventional ones we find that for the MET rotor algorithm the number of sweeps needed to reach a satisfactory solution is practically independent of problem size, while for the other methods it is (with optional step size) roughly proportional to the problem size. As for quality, the final energies obtained by the MET rotor approach were always equal to or lower than the ones obtained with the other approaches. We have also run the MET rotor algorithm for a D = 3 system, where we substituted the appropriate sigmoid in equation 3.3 with a properly scaled tanh function

ui = i,tanh(ui/3)

(3.7)

We find that the algorithm performs as well (if not better) with respect to the number of sweeps needed to obtain a good solution. The reason for this investigation is that this sigmoid is more natural in a VLSI implementation.

4 Summary The formalism and dynamics for D-dimensional feedback rotor neurons have been developed. For D = 1 one recovers the conventional sigmoidal updating equations. As a first test bed for this approach in higher dimensions we applied the method to the problem of finding optimal charge configuration on a D = 3 sphere. The performance of the rotor method appears to be superior to that of gradient descent and simulated annealing for this problem.

Rotor Neurons

745

Other potential problem areas of more practical use are, e.g., curve detection in the early vision system (Zucker et al. 1990), or the reconstruction of tracks from signals (Peterson 1990b). The D > 1 updating equations can of course also be used in feedforward multilayered networks.

References Durbin, R., and Willshaw, G. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689. Gislen, L., Peterson, C., and Soderberg, B. 1989. Teachers and classes with neural networks. lnt. J. Neural Syst. 1, 3. Gislen, L., Peterson, C., and Soderberg, B. 1991. Scheduling high schools with neural networks. Lund university preprint LU TP 91-9 (to appear in Neural Cornp.). Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Bid. Cybern. 52, 141. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P 1983. Optimization by simulated annealing. Science 220, 671. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1991. Using neural networks to identify jets. Nuclear Phys. B 349, 675. Peterson, C. 1990a. Parallel distributed approaches to combinatorial optimization. Neural Comp. 2, 261. Peterson, C. 1990b. Neural networks and high energy physics. In Proceedings of International Workshop on Software Engineering, Artificial Intelligence and Expert Systems for High Energy and Nuclear Physics, Lyon Villeurbanne, France, March , 2990, D. Perret-Gallix and W. Woijcik, eds. Editions du CRNS, Paris. Peterson, C., and Soderberg, 8. 1989. A new method for mapping optimization problems onto neural networks. lnt. J. Neural Syst. 1, 3. Zucker, S. W., Dobbins, A., and Iverson, L. 1990. Two stages of curve detection suggest two styles of visual computation. Neural Comp. 1, 68.

Received 17 September 1991; accepted 27 January 1992.

This article has been cited by: 2. Carsten Peterson, Ola Sommelius, Bo Söderberg. 1996. Variational approach for minimizing Lennard-Jones energies. Physical Review E 53:2, 1725-1731. [CrossRef] 3. I. M. Elfadel . 1995. Convex Potentials and their Conjugates in Analog Mean-Field OptimizationConvex Potentials and their Conjugates in Analog Mean-Field Optimization. Neural Computation 7:5, 1079-1104. [Abstract] [PDF] [PDF Plus]

Communicated by Michael Jordan

Refining PID Controllers Using Neural Networks Gary M. Scott Department of Chemical Engineering, University of Wisconsin, Madison, WI 53706 U S A

Jude W. Shavlik Department of Computer Sciences, University of Wisconsin, Madison, WI 53706 U S A

W. Harmon Ray Department of Chemical Engineering, University of Wisconsin, Madison, W f 53706 U S A

The K B A N N(Knowledge-Based Artificial Neural Networks) approach uses neural networks to refine knowledge that can be written in the form of simple propositional rules. We extend this idea further by presenting the MANNCON (Multivariable Artificial Neural Network Control) algorithm by which the mathematical equations governing a PID (Proportional-Integral-Derivative) controller determine the topology and initial weights of a network, which is further trained using backpropagation. We apply this method to the task of controlling the outflow and temperature of a water tank, producing statistically significant gains in accuracy over both a standard neural network approach and a nonlearning PID controller. Furthermore, using the PID knowledge to initialize the weights of the network produces statistically less variation in testset accuracy when compared to networks initialized with small random numbers. 1 Introduction

Research into the design of neural networks for process control has largely ignored existing knowledge about the task at hand. One form this knowledge (often called the “domain theory”) can take is embodied in traditional controller paradigms. The recently developed KBANN approach (Towel1 et al. 1990) addresses this issue for tasks for which a domain theory (written using simple, nonrecursive propositional rules) is available. The basis of this approach is to use the existing knowledge to determine an appropriate network topology and initial weights, such that the network begins its learning process at a “good” starting point. Neural Computation 4, 746-757 (1992)

@ 1992 Massachusetts Institute of Technology

Refining PID Controllers

747

This paper describes the MANNCONalgorithm, a method of using a traditional controller paradigm to determine the topology and initial weights of a network. The use of a PID controller in this way eliminates network-design problems such as the choice of network topology (i.e., the number of hidden units) and reduces the sensitivity of the network to the initial values of the weights. Furthermore, the initial configuration of the network is closer to its final state than it would normally be in a randomly configured network. Thus, the MANNCONnetworks perform better and more consistently than the standard, randomly initialized three-layer approach. The task we examine here is learning to control a nonlinear MultipleInput, Multiple-Output (MIMO) system. There are a number of reasons to investigate this task using neural networks. First, many processes involve nonlinear input-output relationships, which matches the nonlinear nature of neural networks. Second, there have been a number of successful applications of neural networks to this task (Bhat and McAvoy 1990; Jordan and Jacobs 1990; Miller et al. 1990). Finally, there are a number of existing controller paradigms that can be used to determine the topology and the initial weights of the network. The next sections introduce the MANNCONalgorithm and describe an experiment that involves controlling the temperature and outflow of a water tank. The results of our experiment show that our network, designed using an existing controller paradigm, performs significantly better (and with significantly less variation) than a standard, three-layer network on the same task. The concluding sections describe some related work in the area of neural networks in control and some directions for future work on this algorithm. In the course of this article, we use many symbols not only in defining the topology of the network, but also in describing the physical system and the PID controller. Table 1 defines these symbols and indicates the section of the paper in which each is defined. The table also describes the various subscripts to these symbols. 2 Controller Networks

The MANNCONalgorithm uses a Proportional-Zntegrul-Derivative (PID) controller (Stephanopoulos 19841, one of the simplest of the traditional feedback controller schemes, as the basis for the construction and initialization of a neural network controller. The basic idea of PID control is that the control action u (a vector) should be proportional to the error, the integral of the error over time, and the temporal derivative of the error. Several tuning parameters determine the contribution of these various components. Figure 1 depicts the resulting network topology based on the PID controller paradigm. The first layer of the network, that from ysp (desired process output or setpoint) and ~ ( ~ - 1(actual ) process output

G. M. Scott, J. W. Shavlik, and W. Harmon Ray

748

Table 1: Definitions of Symbols. Symbol

d = [ F d , Td] u = [Fc,FH] y = [F(h),7'1 e E

GI F T h

K, 71, TD AT

w 4yl nu,

(n) (n - 1) SP

d

C H

Definition Process disturbances Process inputs Process outputs Simple error Precompensated error Precompensator matrix Flow rate Temperature Height PID tuning parameters Time between control actions Network weights based on PID controller Error signal at plant output Error signal at plant input

Section Introduced Section 2 Section 2 Section 2 Section 2 Section 2 Section 2 Figure 1 Figure 1 Figure 2 Section 2 Section 2 Section 2 Section 3 Section 3

Subscripts Value at current step Value at previous step Setpoint Disturbance Cold water stream Hot water stream

of the past time step), calculates the simple error (e). A simple vector difference, e = ysp- y

accomplishes this. The second layer, that between e, ~ ( ~ and ~ ~e,1calcu, lates the actual error to be passed to the PID mechanism. In effect, this layer acts as a steady-state precompensator (Ray 19811, where E = Gle

and produces the current error and the error signals at the past two time steps. This compensator is a constant matrix, GI, with values such that interactions at steady state between the various control loops are eliminated. The final layer, that between E and u ( ~(controller ) output/plant input), calculates the controller action based on the velocity form of the discrete PID controller:

Refining PID Controllers

749

Figure 1: MANNCON network showing weights that are initializedusing ZieglerNichols tuning parameters. See Table 1 for definitions of symbols. where KC, TIC, and r x are the tuning parameters mentioned above, and AT is the discrete time interval between controller actions. This can be rewritten as Uc(n)= W ( n - 1 )

+ WCOEl(n) + W C 1 E l ( n - l ) + W C ~ E I ( ~ - ~ )

where WCO, w C 1 , and w c 2 are constants determined by the tuning parameters of the controller for that loop. A similar set of equations and constants ( W H O w, H 1 , w ~ 2 exist ) for the other controller loop. Figure 2 shows a schematic of the water tank (Ray 1981) that the network controls. This figure also shows the variables that are the controller variables (Fc and FH),the tank output variables [F(h)and TI, and the disturbance variables (Fd and Td). The controller cannot measure the disturbances, which represent noise in the system. MANNCONinitializes the weights of the network in Figure 1 with values that mimic the behavior of a PID controller tuned with ZieglerNichols (Z-N) parameters (Stephanopoulos 1984) at a particular operating condition (the midpoint of the ranges of the operating conditions). Using the KBANNapproach (see Appendix), it adds weights to the network such that all units in a layer are connected to all units in all subsequent layers, and initializes these weights to small random numbers several orders of magnitude smaller than the weights determined by the PID parameters. We scaled the inputs and the outputs of the network to be in the range [0,1].

G. M. Scott, J. W. Shavlik, and W. Harmon Ray

750

Cold Stream ( a t Tc-)

Fc

Figure 2: Stirred mixing tank requiring outflow and temperature control. See Table 1 for definitions of symbols. Initializing the weights of the network in the manner given above assumes that the activation functions of the units in the network are linear, that is, o],llWar

=z

w)IoJ

but the strength of neural networks lie in their having nonlinear (typically sigmoidal) activation functions. For this reason, the MANNCONsystem initially sets the weights (and the biases of the units) so that the linear response dictated by the PID initialization is approximated by a sigmoid over the output range of the unit. For units that have outputs in the range [-1,1], the activation function becomes o1,sigmoid =

2 1 exp(-2.31

+

-1

Cw,~o~)

which approximates the linear response quite well in the range [-0.6,0.6]. Once MANNCON configures and initializes the weights of the network, it uses a set of training examples and backpropagation to improve the accuracy of the network. The weights initialized with PID information, as well as those initialized with small random numbers, change during backpropagation training. 3 Experimental Details

We compared the performance of three networks that differed in their topology and/or their method of initialization. Table 2 summarizes the

75 1

Refining PID Controllers Table 2: Topology and Initialization of Networks. ~

Network

Topology

Weight Initialization

Three-layer (14hidden units) PID topology PID topology

Random

~

1. Standard neural network 2. MANNCONnetwork I

3. MANNCON network I1

Random Z-N tuning

network topology and weight initialization method for each network. In this table, “PID topology” is the network structure shown in Figure 1. “Random” weight initialization sets all weights to small random numbers centered around zero. We trained the networks using backpropagation over a randomly determined schedule of setpoint y,, and disturbance d changes that did not repeat. The setpoints, which represent the desired output values that the controller is to maintain, are the temperature and outflow of the tank. The disturbances, which represent noise, are the inflow rate and temperature of a disturbance stream. The magnitudes of the setpoints and the disturbances each formed gaussian distributions centered at 0.5. The number of training examples between changes in the setpoints and disturbances were exponentially distributed. For example, the original setpoints could be an output flow of 0.7 liters/sec at a temperature of 40°C. After 15 sec (which represents 15 training examples since time is discretized into one-second slices), the setpoints could change to new values, such as a flow of 0.3 liters/sec at 35°C. The flow rate and the temperature of the disturbance stream also varied in this manner. We used the error at the output of the plant (y in Fig. 1) to determine the network error (at u) by propagating the error backward through the plant (Jordan and Rumelhart 1990). In this method, the error signal at the input to the process is given by

where 0;Yi represents the simple error at the output of the water tank and SUiis the error signal at the input of the tank. Since we used a model of the process and not a real tank, we can calculate the partial derivatives from the process model equations. We periodically interrupted training and tested the network over a different (but similarly determined) schedule. Results are averaged over 10 runs for each of the networks. We also compare these networks to a (nonlearning) PID controller, that had its tuning parameters determined using a standard design methodology (Stephanopoulos 1984). Using the MIDENTprogram of the CONSYD package (W. Harmon Ray Research Group 1989), we fit a linear, first-order

752

G. M. Scott, J. W. Shavlik, and W. Harmon Ray

model to the outputs of the system when stimulated by random inputs. We then determined an appropriate steady-state precompensator (Ray 1981) and Z-N tuning parameters for a PID controller (Stephanopoulos 1984) using this model. Further details and additional experimentation are reported in Scott (1991).

4 Results

Figure 3 compares the performance of the three networks. As can be seen, the MANNCON networks show an increase in correctness over the standard neural network approach. Statistical analysis of the errors using a t test show that they differ significantly ( p = 0.005). Furthermore, while the difference in performance between MANNCON network I and MANNCON network I1 is not significant, the difference in the variance of the testing error over different runs is significant (p = 0.005). Finally, the MANNCON networks perform significantly better ( p = 0.0005) than the nonlearning PID controller tuned at the operating midpoint. The performance of the standard neural network represents the best of several trials with a varying number of hidden units ranging from 2 to 20. A second observation from Figure 3 is that the MANNCONnetworks learned much more quickly than the standard neural-network approach. The MANNCON networks required significantly fewer training instances

DO Training Instances

Figure 3: Mean square error of networks on the testset as a function of the number of training instances presented.

Refining PID Controllers

753

Table 3: Comparison of Network Performance. Method 1. Standard neural network 2. MANNCONnetwork I 3. MANNCONnetwork I1 4. PID control (Z-N tuning)

Mean square error

Training instances

0.0103 f 0.0004 0.0090 f 0.0006 0.0086 f 0.0001

25,200 f2,260 5,000 f 3,340 640 f 200

0.0131

to reach a performance level within 5% of its final error rate. Table 3 summarizes the final mean error for each of these three network paradigms, as well as the number of training instances required to achieve a performance within 5% of this value.

5 Related Research

A great deal of research in both recurrent networks and in using neural networks for control share similarities with the approach presented here. The idea of returning the output of the network (and of the system) from the previous training instance to become part of the input for the current training instance is quite common in works pertaining to control (Jordan and Jacobs 1990; Miller ef al. 1990). However, this idea also appears in problems pertaining to natural language (Gori ef al. 1989) and protein folding (Maclin and Shavlik 1991). In the area of process control, there have been many approaches that use neural networks. Introductions to neural network control are given by Bavarian (1988), Franklin (19901, and Werbos (1990). Feedforward controllers, in which the neural network learns to mimic the inverse of the plant are discussed by Psaltis et al. (1988), Hosogi (1990))Li and Slotine (19891, and Guez and Bar-Kana (1990). Chen (1990) proposes an adaptive neural network controller where the controller uses a system of two neural networks to model the plant. Herndndez and Arkun (1990) propose a Dynamic Matrix Control (DMC) scheme in which the linear model is replaced by a neural network model of the process. Narendra and Parthasarathy (1990) propose a method of indirect adaptive control using neural networks in which one network is used as a controller while the second is the identification model for the process. Bhat and McAvoy (1990) propose an Internal Model Controller (IMC) that utilizes a neural network model of the process and its inverse. Systems that use neural networks in a supervisory position to other controllers have been developed by Kumar and Guez (1990) and Swiniarski (1990). The book by Miller et al. (1990) gives an overview of many of the techniques mentioned here.

G. M. Scott, J. W. Shavlik, and W. Harmon Ray

754

6 Future Work

In training the MANNCON initialized networks, we found the backpropagation algorithm to be sensitive to the value of the learning rate and momentum value. There was much less sensitivity in the case of the randomly initialized networks. Small changes in either of these values could cause the network to fail to learn completely. The use of a method involving adaptive training parameters (Battiti 19901, and especially methods in which each weight has its 0 ~ 7 nadaptive learning rate (Jacobs 1988; Minai and Williams 1990) should prove useful. Since not all weights in the network are equal (that is, some are initialized with information while some are not), the latter methods would seem to be particularly applicable. Another question is whether the introduction of extra hidden units into the network would improve the performance by giving the network ”room” to learn concepts that are completely outside of the given domain theory. The addition of extra hidden units as well as the removal of unused or unneeded units is still an area with much ongoing research. Some “ringing” occurred in some of the trained networks. A future enhancement of this approach would be to create a network architecture that prevented this ringing from occurring, perhaps by limiting the changes in the controller actions to some relatively small values. Another important goal of this approach is the application of it to other real-world processes. The water tank in this project, while illustrative of the approach, was quite simple. Much more difficult problems (such as those containing significant time delays) exist and should be explored. There are several other controller paradigms that could be used as a basis for network construction and initialization. There are several different digital controllers, such as Deadbeat or Dahlin’s, that could be used in place of the digital PID controller used in this project. DMC and IMC are also candidates for consideration for this approach. Finally, neural networks are generally considered to be “black boxes,” in that their inner workings are completely uninterpretable. Since the neural networks in this approach are initialized with information, it may be possible to in some way interpret the weights of the network and extract useful information from the trained network.

7 Conclusions We have shown that using the MANNCON algorithm to structure and initialize a neural-network controller significantly improves the performance of the trained network in the following ways: 0

Improved mean testset accuracy

Refining PID Controllers 0

Less variability between runs

0

Faster rate of learning

755

The MANNCON algorithm also determines a relevant network topology without resorting to trial-and-error methods. In addition, the algorithm, through initialization of the weights with prior knowledge, gives the backpropagation algorithm an appropriate direction in which to continue learning. Finally, since the units and some of the weights initially have physical interpretations, it seems that the MANNCONnetworks would be easier to interpret after training than standard, three-layer networks applied to the same task. Appendix: Overview of the KBANNAlgorithm The KBANNalgorithm translates symbolic knowledge into neural networks by defining the topology and connection weights of the network (Towell ef al. 1990). It uses knowledge in the form of PROLOG-like clauses, to define what is known about a topic. As an example of the KBANN method, consider the simple knowledge base in Figure 4a, which defines the membership in category A. Figure 4b represents the hierarchical structure of these rules where solid lines and dashed lines represent necessary and prohibitory dependencies, respectively. Figure 4c represents the neural network that results from a translation of this knowledge base. Each unit in the neural network corresponds to a consequent or an antecedent in the knowledge base. The solid and dashed lines represent heavily weighted links in the neural network. The dotted lines represent the lines added to the network to allow refinement of the knowledge base.

/Ft\ \ if B, C then A if G. not(F) then B if 1, J then C

Figure 4: Translation of a knowledge base into a neural network using the

KBANNalgorithm.

756

G. M. Scott, J. W. Shavlik, and W. Harmon Ray

Acknowledgments G. M. S. was supported under a National Science Foundation Graduate Fellowship. J. W. S. was partially supported by Office of Naval Research Grant N00014-90-J-1941 and National Science Foundation Grant IRI-9002413. W. H. R. was partially supported by National Science Foundation Grant CPT-8715051.

References Battiti, R. 1990. Optimization methods for back-propagation: Automatic parameter tuning and faster convergence. In International joint Conference on Neural Networks, Vol. I, pp. 593-596, Washington, DC. Lawrence Erlbaum. Bavarian, B. 1988. Introduction to neural networks for intelligent control. IEEE Control Syst. Mag. 8, 3-7. Bhat, N., and McAvoy, T. J. 1990. Use of neural nets for dynamic modeling and control of chemical process systems. Comput. Chem. Eng. 14,573-583. Chen, E-C. 1990. Back-propagation neural networks for nonlinear self-tuning adaptive control. IEEE Control Syst. Mag. 10. Franklin, J. A. 1990. Historical perspective and state of the art in connectionist learning control. In 28th Conference on Decision and Control, Vol. 2, pp. 17301735, Tampa, FL. IEEE Control Systems Society. Gori, M., Bengio, Y., and DeMori, R. 1989. Bps: A learning algorithm for capturing the dynamic nature of speech. In International Joint Conference on Neural Networks, Vol. 11, pp. 417-423. San Diego, CA. IEEE. Guez, A., and Bar-Kana, I. 1990. Two degree of freedom robot adaptive controller. In American Control Conference, Vol. 3, pp. 3001-3006. San Diego, CA. IEEE. Hernandez, E., and Arkun, Y. 1990. Neural network modeling and an extended DMC algorithm to control nonlinear systems. In American Control Conference, Vol. 3, pp. 2454-2459. San Diego, CA. IEEE. Hosogi, S. 1990. Manipulator control using layered neural network model with self-organizing mechanism. In International joint Conference on Neural Networks, Vol. 2, pp. 217-220. Washington, DC. Lawrence Erlbaum. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1, 295-307. Jordan, M. I., and Jacobs, R. A. 1990. Learning to control an unstable system with forward modeling. In Advances in Neural Information Processing Systems, Vol. 2, pp. 325-331. San Mateo, CA. Morgan Kaufmann. Jordan, M. I., and Rumelhart, D. E. 1990. Forward models: Supervised learning with a distal teacher. Occasional Paper #40, Massachusetts Institute of Technology (to appear in Cog. Sci.). Kumar, S. S., and Guez, A. 1990. Adaptive pole placement for neurocontrol. In International Ioint Conference on Neural Networks, Vol. 2, pp. 397400. Washington, DC. Lawrence Erlbaum.

Refining PID Controllers

757

Li, W., and Slotine, J.-J. E. 1989. Neural network control of unknown nonlinear systems. In American Control Conference, Vol. 2, pp. 1136-1141. San Diego, CA. IEEE. Maclin, R., and Shavlik, J. W. 1991. Refining domain theories expressed as finite-state automata. In Eighth International Workshop on Machine Learning. Morgen Kaufmann, San Mateo, CA. Miller, W. T., Sutton, R. S., and Werbos, P. J., eds. Neural Networks for Control. The MIT Press, Cambridge, MA. Minai, A. A., and Williams, R. D. 1990. Acceleration of back-propagation through learning rate and momentum adaptation. In International Joint Conference on Neural Networks, Vol. I, pp. 676-679. Washington, DC. Lawrence Erlbaum. Narendra, K. S., and Parthasarathy, D. 1990. Identification and control of dynamical systems using neural networks. IEEE Transact. Neural Networks 1(1), 4-27. Psaltis, D., Sideris, A,, and Yamamura, A. A. 1988. A multilayered neural network controller. I E E E Control Syst. Mag. 8, 17-21. Ray, W. H. 1981. Advanced Process Control. McGraw-Hill, New York. Scott, G. M. 1991. Refining PID controllers using neural networks. Master’s project, University of Wisconsin, Department of Computer Sciences, May. Stephanopoulos, G. 1984. Chemical Process Control: A n Introduction to Theory and Practice. Prentice-Hall, Englewood Cliffs, NJ. Swiniarski, R. W. 1990. Novel neural network based self-tuning PID controller which uses pattern recognition technique. In American Control Conference, Vol. 3, pp. 3023-3024. San Diego, CA. IEEE. Towell, G. G., Shavlik, J. W., and Noordewier, M. 0. 1990. Refinement of approximate domain theories by knowledge-base neural networks. In Eighth National Conference on Artificial Intelligence, pp. 861-866. AAAI Press, Menlo Park, CA. W. Harmon Ray Research Group. 1989. Department of Chemical Engineering, University of Wisconsin, Madison. CONSYD: Computer-Aided Control System Design. Werbos, P. J. 1990. Neural networks for control and system identification. In 28th Conference on Decision and Control, Vol. 1, pp. 260-265. Tampa, FL. IEEE Control Systems Society. ~

Received 6 September 1991; accepted 28 February 1992.

This article has been cited by: 2. Gary M. Scott , W. Harmon Ray . 1994. Neural Network Process Models Based on Linear Model StructuresNeural Network Process Models Based on Linear Model Structures. Neural Computation 6:4, 718-738. [Abstract] [PDF] [PDF Plus] 3. Jude W. Shavlik. 1994. Combining symbolic and neural learning. Machine Learning 14:3, 321-331. [CrossRef]

Communicated by Francoise Fogelman-Soulie

Ultrasound Tomography Imaging of Defects Using Neural Networks Denis M. Anthony Evor L. Hines David A. Hutchins J. T. Mottram Department of Engineering, Warwick University, Coventry, England

Simulations of ultrasound tomography demonstrated that artificial neural networks can solve the inverse problem in ultrasound tomography. A highly simplified model of ultrasound propagation was constructed, taking no account of refraction or diffraction, and using only longitudinal wave time of flight (TOF). TOF data were used as the network inputs, and the target outputs were the expected pixel maps, showing defects (gray scale coded) according to the velocity of the wave in the defect. The effects of varying resolution and defect velocity were explored. It was found that defects could be imaged using time of flight of ultrasonic rays. 1 Introduction

One of the main techniques for use in the nondestructive testing (NDT) of materials is ultrasound. Defects in a material may be detected due to a change in the time of flight (TOF) through the material between two fixed points, amplitude changes in the ultrasonic waves, or by examining the transmitted waveforms. Many measurements need to be taken to scan a material, and this can be time-consuming and costly. Ultrasound tomography, where several measurements are taken simultaneously, may be used to reconstruct an image of the test material. Algorithms exist to reconstruct an image, but there are difficulties to overcome, in that the ultrasonic waves, unlike X-rays, may not be assumed to travel in straight lines, and are affected by diffraction and refraction. Artificial neural networks (ANNs) may be used to solve problems without specifying an algorithm. In one common technique, backpropagation, inputs and target outputs are presented to the network, and the network forms connection strengths between inputs, through "hidden units" through to outputs, such that the outputs attempt to match the targets, in an iterative period of "training." The inethod is essentially an error gradient descent technique. Neural Computation 4, 758-771 (1992)

@ 1992 Massachusetts Institute of Technology

Ultrasound Tomography Imaging of Defects

759

ANNs have been used (Watanabe and Yoneyama 1990), to recognize digits scanned by ultrasonic "eyes" and digits the network was not trained on were reconstructed with some success. ANNs have determined the radius of a cylindrical object, given the real and imaginary pressures from an array of 16 transducers surrounding the object (Conrath et al. 1989). ANNs may be used to process data, for example, in medical ultrasound, delay noise has been reduced in simulations where an ANN preprocessed echo delays prior to beamforming (Nikoonahad and Liu 1990). Other NDT applications using ANNs include the inversion of eddy currents to give flaw size (Mann et al. 1989). ANNs have also been used in other tomographic modalities, e.g., laser scattering tomography (Gonda et al. 1989). The aim of this paper is to report work that has been performed using simulation experiments to solve the inverse problem in ultrasound imaging. The final target is to produce tomographic maps of anisotropic composite materials containing defects that may be the result of service load conditions (e.g., with carbon fiber reinforced plastics). These are particularly difficult to image using ultrasound as the orientation of the fibers in the resin matrix affects the wave propagation, and the fibers act as channels for the waves, thus adding to the above mentioned diffraction and refraction effects. ANNs might be successful in solving this mapping probIem. Prior to attempting the more difficult anisotropic case, it was decided to initially determine whether isotropic media could be imaged. Problems in this type of material would be expected to be less severe than the anisotropic case, and thus the more difficult problem would be unlikely to be solved unless the simpler case was shown to be soluble. Simulations were undertaken as a prelude to experimental data acquisition, to allow tighter control and for speed of development. 2 Tomography

Computer aided tomography (CAT) is a method by which a series of views of an object may be used to form an image. A typical CAT application is X-ray imaging. In conventional X-ray, the rays cover a plane, while in CAT a series of scans are produced each with a thickness or height of 1 cm or less. The series of views are repeated at different angles. While X-ray is the best known CAT technique, other modalities have been imaged in this way including nuclear scintigram images, microwaves, electron microscopy, and ultrasound. Reviews of tomography may be found in, for example, Scudder (1978) and Mueller et al. (1979). In principle if the projection data are represented by a matrix X, and the absorption (assuming X-ray tomography) by Y, then there is a matrix A that gives the relation between the projection data and absorption: X=AY

(2.1)

760

D. M. Anthony et al.

and to reconstruct an image one needs to find the inverse A-’ or, if it does not exist, a pseudoinverse A*. However, the sizes of the matrices are typically far too large to allow matrix inversion to be used as a reconstruction technique, and some other method needs to be found. Common methods used include iterative algebraic reconstruction technique (ART) and Fourier transform techniques; ART is discussed further in Section 5. Certain methods specific to applications also exist, ( e g , diffraction methods in ultrasound). ANNs have been used (Obellianne et al. 1985) in simulated tomography reconstruction, where simulated projection data were processed in a conventional reconstruction algorithm. An ANN employing backpropagation was then used to improve the image by reducing the effects of noise. 3 The Simulation Model

A thin sheet specimen of square cross section was created, which was assumed to have constant in-plane wave velocity in all directions ke., isotropic). A defect lying in the plane may in principle be of any size, however, a finite imaging resolution will limit the representation or display of the defect. The resolution will be (say) N by N. It was assumed that if the defect lies in any of the N 2 cells, the cell will be designated as having the defect velocity, and if a cell contains no part of a defect it will be designated as background material wave velocity. Thus where a defect lies across the boundary of two cells, both cells will be assumed to have the defect wave velocity. Figure 1 illustrates the scanning method that was adopted for the simulations. An ultrasonic transmitter was assumed to move around the outside of the pixel area, and the ultrasonic signal to be received at all other receiver locations not on the same edge as the transmitter. (The transducers around the specimen may act as transmitters or receivers.) In subsequent discussions the actual defect dimensions will be referred to as the ”ideal” image. The image that results from a given output image resolution will be called the ”target” image. The ANN will be given inputs that will be the TOF of ultrasound rays, and will be trained to give the associated target pixel values. 4 Statistics on Ray Paths

-

If the ANN is to be able to determine the location and size of a defect, there must be a difference between the inputs (TOF data) with a defect and those with no defect. If no rays pass through the defect, the TOF is not affected in the above model, and the ANN will not detect a defect, let alone be able to locate it. For each defect in a randomly selected sample of 500 single defects, the paths were examined to determine whether any

Ultrasound Tomography Imaging of Defects

761

Figure 1: Transducers setup for tomography. Conventional setup using parallel beams at angle Oi (left). Fan ray tomogram as used for ANN simulation (right). The material is assumed to be split into N x N cells, within each cell wave velocity is assumed constant; here 100 cells are shown. The ultrasonic beam is sent from transmitter (marked as closed circles around the perimeter of the square material), and are received by transducers (open circles) along the ray path indicated by arrows. paths went through the defect, and how many defects had not a single ray path going through them. There is a relationship between the number of the transducers, i.e., their number, and the size of objects that may be detected (see Fig. 1). If transducers are arranged 4 x 4 along the edges, and each is connected to all 4 on each of the other 3 edges, then 192 paths are created (16 x 3 x 4, i.e., 16 transducer positions, each with 12 active receivers). In 500 defects with 192 potential paths for each, 26.8% of paths went through a defect. Of the 500 input patterns, only 22 had no path at all. Of the defects with no path, 12 had zero dimensions (the height or width randomly allocated could be zero), and of the 10 defects remaining, the largest had an area of 0.32 of a pixel. Thus only very small defects are likely to be completely missed by all ray paths, and as Figure 2 shows, most defects are crossed by many paths.

5 Time of Flight Calculations Initially a very simple model is proposed. The ultrasound wave propagation is assumed to travel in straight lines, with constant velocity in the medium, and with a lower constant velocity in the defects. No account is taken of diffraction, refraction, reflection, or the effects of anisotropy, only longitudinal waves are assumed, and no account is taken of shear waves, etc. Consider a specimen surrounded by a medium of constant

762

D. M. Anthony et al.

Figure 2: (a) Histogram of defect path:total path for defects with a ray path through them. This figure shows that most defect ray paths were small in comparison to the total path. (b) Number of paths passing through defects, showing the majority of defects have many ray paths passing through them. propagation. The specimen can be split into cells within which a constant velocity may be assumed. The simplistic assumption is that the TOF may be computed by taking a straight line between the transmitter and any of the 12 receivers, and by working out the distance traveled in the cells (Mueller et al. 1979). For a given angle of incidence Oi. the TOF for ray j at angle i, which is the measured value, is given by

(5.1) where D is the total distance between transmitter and receiver, I$) is the length of the kth cell traversed by the jth ray at orientation i, C, is the

Ultrasound Tomography Imaging of Defects

763

velocity of wave propagation in the medium surrounding the specimen, Ck the (constant) wave velocity in the kth cell, and N is the number of cells. If one denotes nk = (1/Ck) - (l/C,,,), one has a parameter to be reconstructed ( n k ) subject to

where ~ i = j Tij - (D/C,). This forms a set of M equations, if M rays are considered, in N unknowns, which may be solved using algebraic reconstruction techniques (ART), convolution methods, Fourier analysis, or diffraction methods (Mueller et al. 1979). In this study a solution to the equations will be attempted using ANNs. The Tij will be given as inputs and the required image as targets to the ANN. In principle, a linear network can solve this equation, as the set of equations is linear. Nonlinear nets will be employed since in practice nonlinearities may be encountered, for the TOF calculations have built in assumptions that make the linear equations an approximation to reality only. Thus the nonlinear net needs to be tested in the linear simulation because it will be used in studies based on experimental data. The distances are calculated using simple geometry, where the transducer locations and the rectangular defect coordinates are known. In a reconstruction each transducer would in turn be the transmitter, and the remaining transducers on the other three sides would be receivers. The net will be required to solve the inverse problem of equation 5.2, i.e.,

6 Experimental Procedures Given an array of transducers, the effects of varying the resolution of the final image and the ratio of defect wave velocity to background material wave velocity were explored. The imaging of rectangular single defects of arbitrary dimensions was attempted. A set of 16 transducers were assumed to be arranged spaced equally around a material of square cross section. TOF paths were computed from all transmitters to every other receiver not on the same edge. This gave 192 values. There was redundancy in the data as each path was given twice, due to the symmetry of the problem, but as in real simulations, especially in anisotropic media, these TOF would not be identical due to noise and other effects; it was therefore decided to include all values. The net with all inputs used is potentially more liable to fall into uninteresting one to one mappings (i.e., act as a look-up table) than one using half the inputs, and thus one needs to test the net for this potential problem, if one is later to use all inputs in an experimental study. These

D. M. Anthony et al.

764

were fed into an input layer of 192 units, which were fed into a hidden layer of 48 units, and the hidden layer was connected to an output layer that gave pixel values for a reconstructed image. The figure of 48 hidden units is somewhat arbitrary; unfortunately there is little theoretical justification for using any particular architecture. Too large a number of units in the network will give slow convergence, and may not generalize well; too few units will not allow a solution to the problem. The architecture used in this study gave convergence, and the nets showed similar performance on test data as training data, thus the nets did generalize. The nets also converged in a reasonable time. In these simulations a fan-beam geometry is assumed (see Fig. 1). The equivalent setup in parallel beam tomography would be to have K , the number of projections, as given by

where Nd is the total number of receivers, as each transmitter sends rays to the receivers on all other three sides. It is known in such a system that to avoid aliasing (Platzer 1981), the number of projections must be given by

Thus the minimum number of receivers that will avoid aliasing is given by

N >8T

"-3

(6.4)

In Obellianne et al. (1989) it was found that recognition of simulated "lungs" dropped to very low levels when the ratio of receivers to projections fell below the minimum specified by equation 6.2, though the error level of the output image was able to be reduced substantially by the network. The figure of 12 receivers used in this study thus should not suffer from the problem of aliasing. In all simulations error backpropagation was employed (Rumelhart et al. 1986). The network parameters of learning rate (TI) and momentum (0)need to be given values. The lower the value of 71 the more likely convergence is achieved, but very low values may slow convergence unnecessarily. An (Y close to unity speeds convergence, and Rumelhart ef a/. (1986) suggest a high (1 of about 0.9 with a low 7 is superior to a higher 7 with no momentum term. An 77 of 0.001 and (Y of 0.9 were used, and these values were found to allow convergence in all cases.

Ultrasound Tomography Imaging of Defects

765

6.1 Defect Types. Arbitrary dimensioned rectangular defects were randomly created for the training set. The resolution of the target image was set, and if a pixel contained any of a defect, that pixel was allocated to the defect, otherwise it was allocated to the background material. Pixels were given values according to the wave velocity of defect or background. Thus images were overrepresented in size as defects were rounded up in size for the target pixel map, and a gray scale was produced whereby white indicates the highest velocity and black zero velocity. Initially 4 by 4 resolution images were created using a defect velocity 10% that of the background velocity. This large reduction in velocity value due to the presence of a defect may not be realistic, but a large difference in velocities was found to allow easier network learning. As Figure 3 shows, the ANN restored image locates the defect with some success. 6.2 Increasing Resolution. Increasing the resolution of the output pixel map to 10 by 10 gives better restored images than the 4 by 4 case (see Fig. 3). (Note: The number of the receivers remains at 4 by 4.) Defects whose length-to-width ratio is high are not detected in either resolution. This is likely to be due to the fact that the defect affects the TOF between transmitter and receiver very little, and the ANN has insufficient variability of its inputs to reconstruct the defect. The ability of a network to converge is often shown by plotting total sum square error (tsse) against number of epochs of training, which is defined as tsse =

C C(tP- o')* p=l NP

"i=l

I

(6.5)

where Np is the number of data patterns and N" the number of outputs, 't and up are the target and output values, respectively, for the pth pattern and ith target or output value. Figure 4 shows tsse per pixel of networks with various resolutions, which shows the trend toward lower error as the resolution (number of pixels in the reconstructed image) increases. 6.3 Repeatability. To test whether the ANNs were giving consistent and repeatable results, several runs were made using different initial random weights. As the ANNs were found to converge within two to five epochs (see Fig. 4) the nets were trained for 10 epochs each. The nets showed very similar behavior on repetition. The tsse in every case converged within 2-3 epochs, and was similar for both training and test (i.e., using data that the net had not been trained on) data. The absolute value of weights from input to hidden and hidden to output nodes slowly increased with time.

D. M. Anthony et al.

766

1

I

Figure 3: Target (left), restored (center), and ideal (right) images of randomly created (single) rectangles. The ideal images are the actual dimensions of the simulated defect. The target image is created by using a given pixel resolution (top 2 images 4 x 4, bottom 3 images 10 x 10) so the defect is rounded up in size; this is the image the ANN is given during training as a target. The ANN after training gives the center image when presented with the TOF data.

6.4 Amendments to Backpropagation Algorithm. To improve images two changes were made to the training algorithm. The majority bias problem occurs when most of the outputs of the net are set to the same value most of the time, and the net can reduce the tsse quickly by fixing the outputs at this value (i.e., the network gets “ s t u c k at these

Ultrasound Tomography Imaging of Defects

Key:

767

-2by 2,-.-.3 by 3, - 4 by 4, ...10 by lOpuds ~

Use per Pixel against epoch for 4 by 4 Rcaolution, Single Dcfed

Key: - 10% D c f s l Velocity. -.50%Dcfm Velocity

Figure 4: tsse per pixel for various resolutions against number of epochs of training-single defect (top graph). As the resolution increases the tsse drops. Lower graph-tsse per pixel for 4 by 4 resolution against epochs of training. Single defect for 2 different defect velocities. The tsse for the defect velocity 50% that of background material is lower than the tsse of a defect with wave velocity 10% that of background material (lower graph). However, subjectively the image with the lower tsse was not improved. Bottom figures show, in order, target and output from defect velocity 10% that of background material, target and output for 50% defect velocity.

768

D. M. Anthony et al.

outputs). As Chiu et al. (1990) have shown, the use of thresholding may improve images where the majority bias problem is apparent. In the case of small defects, most of the outputs are set to the background material, and thus this may partly explain the poor performance of the network to detect small defects. Thresholding was applied to the network, whereby any output that was within 10% of its target contributed no error to the backpropagation algorithm. This would be expected to allow the network to concentrate on the smaller number of defect pixel outputs. However, the performance was not improved in 10 repeated runs using this technique where the criterion of good performance was receiver operating characteristic curves (ROC, see Section 6.5). To speed convergence of nets, dynamic parameter tuning may be employed, as in Vogl et al. (1988). In this technique the learning rate and momentum are altered according to whether the tsse is decreasing or increasing. Where the tsse decreases from one epoch to the next, rl is increased by a scalar (we used 1.1>,and where the tsse increases o is set to zero, and 71 reduced by a scalar value (we used 0.7). In 10 repeated runs little advantage was found using this meth-od in terms of speed of convergence or the performance of the net. In every case 71 increased for the first few epochs (2-5 typically) and then reduced. 6.5 Ratio of Defect to Background Wave Velocity. Reducing the difference between background and defect velocities decreases the tsse for single defects (Fig. 4). However, looking at the reconstructed images, the ANN does not perform any better, and subjectively appears to be worse (see Fig. 4). It might be the case that the ANN reduces the tsse by setting the output pixels to a value midway between defect and background. The different slopes of the two cases indicate that they may be learning different mappings. To test the relative performance of the nets receiver operating characteristic (ROC) curves were constructed. This technique is well known in medical informatics, and for a description see ToddPokropek (1983) and Swets (1979). Essentially a graph is plotted of the true positive rate against false positive rate for a series of confidence levels. This technique has been applied to ANN analysis (Meistrell 1990; Anthony 1991). An area significantly above 0.5 indicates that the network is not giving random classifications. The area under the ROC curves for 10 repeated runs was calculated for the defect velocity of 50 and of 10% that of the background velocity. Using the t test of means, there was no significant difference between the two sets ( p > 0.1). To test further the effect of defect velocity, a training set was constructed which had defect velocity 90% that of background. There was a significant difference between this set and, separately, both the other two sets (10 and 50%, p < 0.0001) (see Table 1). For 10 versus 50% the variances were not significantly different. For both 10 and 50% versus 90% defect velocity, there was a difference in the variances, and this was taken account of in the t test using Satterthwaite’s approximation

Ultrasound Tomography Imaging of Defects

769

Table 1: Area under ROC curve for various defect velocities Defect/background velocity Mean area Standard deviation 0.1 0.5 0.9

0.639 0.634 0.562

0.009 0.009 0.018

for degrees of freedom (SAS 1988) (The SAS package was used to perform the tests.). 7 Conclusion

A simplistic ultrasound simulation has been constructed. An ANN solved the inverse problem, i.e., given the TOF data it was able to construct a pixel image. The conclusions that may be drawn from this study are as follows: 0

0

0

0

0

The network behavior was repeatable. As the resolution of single arbitrary dimensioned rectangular defects increases, the tsse per pixel decreases. This may be due to a reduction in rounding u p of defects for the targets. The defect velocity is critical. If the defect velocity is close to the background velocity, the defect recognition rate reduces. This may have importance when analyzing experimental data. If the data show small differences for TOF comparing defect and nondefect paths, one may need to preprocess the data to increase the dynamic range. The use of thresholding errors from the output layer did not solve the majority bias problem. Dynamic learning did not improve the convergence in these simulations.

Using ANN techniques may provide alternative methods of reconstructing tomographic images. It may be the case that conventional tomographic reconstruction will continue to be used in isotropic materials, and that ANNs may be used in anisotropic materials. Another possibility would be the marriage of the two techniques, such that the conventional method produces an output which is improved by a neural net. Future work planned by the authors will explore aIternative network designs on the simulation data [e.g., neural trees (Sankar and Mammone 1991)l. The ability of ANNs to detect small defects using simulation data has been tested, and will shortly be published (Anthony et al. 1991). An

770

D. M. Anthony et al.

experimental tomographic setup has been constructed, which will allow data to be collected on specimens with known defects. The data will be analyzed to determine the most appropriate transformation of the data for input to the neural net. Initial study indicates the ratio of Fourier transform amplitudes, or the area under the Fourier transform, may allow a single number from each ultrasonic wave to be used. These parameters seem to be affected when a defect lies in the path of the ultrasonic wave. If experimental data are reconstructed with some success, the trained neural network could be implemented in hardware to form a n imaging system. Acknowledgments This project is funded by the Science and Engineering Research Council

(S.E.R.C.), U.K. References Anthony, D. M. 1991. The use of artificial neural networks in classifying lung scintigrams. Ph.D. thesis, University of Warwick. Anthony, D. M., Hines, E. L., Hutchins, D., and Mottram, J. T. 1992. Simulated tomography ultrasound imaging of defects. In Series in Neural Networks, Springer, Berlin. Chiu, W. C., Anthony, D. M., Hines, E. L., Forno, C., Hunt, R., and Oldfield, S. 1990. Selection of the optimal Mlp net for photogrammetric target processing. Zasted Conf. Artificial Intelligence App. Neural Networks, pp. 180-183. Conrath, B. C., Daft, C. M. W., and OBrien, W. D. 1989. Applications of neural networks to ultrasound tomography. Ultrasonics Symposium, 1007-1010. Gonda, T., Kakiuchi, H., and Moriya, K. 1989. In situ observation of internal structures in growing ice crystals by laser scattering tomography. J. Crystal Growth 102,179-184. Mann, J. M., Schmerr, L. W., and Moulder, J. C. 1989. Neural network inversion of uniform-field eddy current data. Materials Evaluation, Jan. 34-39. Meistrell, M. L. 1990. Evaluation of neural network performance by receiver operating characteristic (ROC)analysis: examples from the biotechnology domain. In Computer Methods and Programs in Biomedicine, pp. 73-80. Elsevier, Amsterdam. Mueller, R. K., Kaveh, M., and Wade, G. 1979. Reconstructive tomography and applications to ultrasonics. IEEE Proc. 67(4), 567-587. Nikoonahad, M., and Liu, D. C. 1990. Medical ultrasound imaging using neural networks. Electronic Lett. 26, 545-546. Obellianne, C., Fogelman-Soulie, F., and Galibourg, G. 1989. Connectionist models for image processing. In From Pixels to Features, COST23 Workshop, J. C. Simon, ed., pp. 185-196. Elsevier, North Holland. Platzer, H. 1981. Optical image processing. Proc. 2nd Scandinavian Conf. Zmage Analysis, pp. 128-139.

Ultrasound Tomography Imaging of Defects

771

Rumelhart, D. E., Hinton, G. E., and Williams,N. 1986. Learning representations by back-propagating errors. Nature (London) 323(9), 533-536. Sankar, A., and Mammone, R. J. 1991. Optimal pruning of neural tree networks for improved generalization. Proc. 4th International Joint Conference on Neural Networks, Seattle. IEEE. SAS Users' Guide Statistics. Volume 5, 798. Scudder, H. J. 1978. Introduction to computer aided tomography. IEEE Proc. 66(6),62-37. Swets, J. A. 1979. ROC analysis applied to the evaluation of medical imaging techniques. Invest. Rudiol. 14(2), 109-121. Todd-Pokropek, A. E. 1983. The comparison of a black and white and a color display: An example of the use of receiver operating characteristic curves. IEEE Transact. 19-23, MI-2. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., and Alkon, D. L. 1988. Accelerating the convergence of the back-propagation method. Biol. Cybern. 59,257-263. Watanabe, S., and Yoneyama, M. 1990. Ultrasonic robot eyes using neural networks. IEEE Transact. Ultrason.Ferroelect. Freq. Control 37(3), 141-147. Received 18 June 1991; accepted 1 December 1991.

This article has been cited by:

Communicated by Richard Lippmann

Improving the Accuracy of an Artificial Neural Network Using Multiple Differently Trained Networks William G. Baxt Department of Emergency Medicine and Medicine, University of California, San Diego Medical Center, Sari Diego, C A 92103-8676 U S A When either detection rate (sensitivity) or false alarm rate (specificity) is optimized in an artificial neural network trained to identify myocardial infarction, the increase in the accuracy of one is always done at the expense of the accuracy of the other. To overcome this loss, two networks that were separately trained on populations of patients with different likelihoods of myocardial infarction were used in concert. One network was trained on clinical pattern sets derived from patients who had a low likelihood of myocardial infarction, while the other was trained on pattern sets derived from patients with a high likelihood of myocardial infarction. Unknown patterns were analyzed by both networks. If the output generated by the network trained on the low risk patients was below an empirically set threshold, this output was chosen as the diagnostic output. If the output was above that threshold, the output of the network trained on the high risk patients was used as the diagnostic output. The dual network correctly identified 39 of the 40 patients who had sustained a myocardial infarction and 301 of 306 patients who did not have a myocardial infarction for a detection rate (sensitivity) and false alarm rate (1-specificity) of 97.50 and 1.63%, respectively. A parallel control experiment using a single network but identical training information correctly identified 39 of 40 patients who had sustained a myocardial infarction and 287 of 306 patients who had not sustained a myocardial infarction ( p = 0.003). 1 Introduction

Artificial neural networks have been shown to be a powerful pattern recognition paradigm (Widrow and Hoff 1960; Rumelhart et al. 1986; McClelland and Rumelhart 1988; Weigend et al. 1990). It has been recently demonstrated that artificial neural networks can be applied to the analysis of clinical data (Hudson etal. 1988; Smith et al. 1988; Saito and Nakano 1988; Kaufman et al. 1990; Hiraiwa et al. 1990; Cios et al. 1990; Marconi et al. 1989; Eberhart et al. 1991; Mulsant and Servan-Schreiber 1988; Bounds et al. 1990; Yoon et al. 1989). Both retrospective and prospective studies Neural Computation

4, 772-780 (1992)

@ 1992 Massachusetts Institute of Technology

Improving the Accuracy of a Neural Network

773

of the application of this technology to the diagnosis of acute myocardial infarction have revealed that the network can perform substantially more accurately than physicians (Baxt 1991a,b) or other electronic data processing paradigms (Goldman et LIZ. 1988; Pozen et al. 1984). The performance of an artificial neural network is highly dependent on the composition of the data on which it is trained. The ability of a network to identify a pattern is directly related to the representation of that pattern in the data used to train the network. In settings in which the networks are trained to make categorical decisions, this appears to translate into the known reciprocal relationship between detection and false alarm rates. It was observed that if a network trained to recognize the presence of myocardial infarction is trained on a pattern set derived from a patient population in which the likelihood of myocardial infarction was high and in which most patients appeared to have sustained a myocardial infarction, the network performed with a high detection rate but less than optimized false alarm rate. Similarly, when the network was trained on a population in which the likelihood of myocardial infarction was low and in which most patients did not appear to have sustained a myocardial infarction, the network performed with a low false alarm rate but less than optimized detection rate. A number of strategies were developed to try to simultaneously improve both detection rate and false alarm rate by using a single predesigned pattern set. Training sets with varied numbers of patients who had and had not sustained a myocardial infarction as well as varied numbers of patients who on presentation appeared to have sustained and not sustained a myocardial infarction were utilized to train the network. Training parameters were also relaxed in order to aIlow the network greater generalization. These strategies all failed to improve accuracy. Because it had been observed that networks could be trained to optimize one end of the categorical spectrum, it was reasoned that it might be possible to design an algorithm that could utilize two networks in concert, each trained to optimize either detection rate or false alarm rate. To this end, two networks were trained separately, one on a low-risk population of patients and the other on a high-risk population of patients, and then used simultaneously to test unknown patterns. The following reports on the results accrued from this strategy. 2 Methods

The architecture of the artificial neural network used in this study is illustrated in Figure l. It consisted of 20 input units, two layers of 10 internal or hidden units, and 1 output unit (Baxt 1991a,b). The artificial neural network simulator for this project was written specifically for this study in C and run on a UNIX work station at 1.5 mflops. The training algorithm used was standard backpropagation (Widrow and Hoff 1960;

William G. Baxt

774

2 O X 1 O X 1 O X 1 BACK PROPAGATION NETWORK

INPUT U N I T S

HIDDEN U N I T S

HIDDEN UNITS

ObTPUT UN I T

Figure 1: Neural network architecture. Rumelhart et a!. 1986; McClelland and Rumelhart 1988) with the Stornetta and Huberman (1987) modification. The network utilized in this study used inputs representing the encoded presenting complaints, past history, physical and electrocardiographic findings of adult patients presenting to the emergency department with anterior chest pain. These clinical variables are listed in Table 1. The network output was trained to represent the presence or absence of acute myocardial infarction (0 = absence; 1 = presence of myocardial infarction). The network training process consisted of the retrospective selection of a large number of patients who had presented with anterior chest pain in whom the presence or absence of acute myocardial infarction was known. Training consisted of the repeated sequential presentation of the training set to the network until the error in the output stopped decreasing. Overtraining was prevented by the testing of the network on a subgroup of patterns to which it had never been exposed. When the error made on this set stopped decreasing, training was considered optimized. Two groups of patients were utilized in the study. The first group, or low-risk (LR) group, consisted of 346 patients who presented to the emergency department and were felt to have a low risk for the presence of myocardial infarction. This group was utilized to both train and test. The second group, or high-risk (HR) group, consisted of 350 patients who had been admitted to the coronary intensive care unit to rule out the presence of myocardial infarction. All these patients were felt to have a high likelihood for myocardial infarction when initially evaluated. This group was used only to train. The breakdown of these groups is summarized in Table 2.

775

Improving the Accuracy of a Neural Network Table 1: Input Variables. History

Past history

Examination

Agea

Past AM1

Sex

Angina

Left anterior location of pain

Diabetes

Jugular venous 2-mm ST elevation distension 1-mm ST elevation Rales ST depression

Hypertension

Electrocardiogram

T wave inversion

Nausea and vomiting Significant ischemic change

Diaphoresis Syncope Shortness of breath Palpitations Response to nitroglycerin “Analog coded.

Table 2: Patient Groups. LRA group

Low Likelihood Group Nonmyocardial infarction 153 Myocardial infarction 20 Weight file derived from training LRA

LRB group 153 20 LRB

High Likelihood Group Nonmyocardial infarction 230 Myocardial infarction 120 Weight file derived from training HR The training parameters of learning rate and momentum were set a t 0.05 and 0.9, respectively. Five pattern sets were constructed and utilized to train the networks. Sets #1-2: the 346 low-risk patients were divided into two groups (LRA and LRB), each with half of the patients w h o did not sustain a myocardial infarction and half of the patients who did

776

William G. Baxt

sustain a myocardial infarction. Set #3: the 350 high-risk patients were used to construct one high-risk pattern set (HR). In order to demonstrate that the two-network strategy was responsible for any improvement in accuracy and not the information on which the networks were trained, two additional pattern sets were constructed that contained all the information present in both the low- and high-risk patient groups. Sets #4-5: each of the low-risk pattern sets was combined with the high-risk training set (LRA + HR and LRB + HR) and used to train one network each. Twenty percent of the patients who did and did not sustain a myocardial infarction were removed from each of the five sets and used as a training validation set. Training was considered optimized when the total number of errors made on the validation set stopped decreasing using an output of 0.5 as the discriminator between the presence and absence of myocardial infarction ( 2 0.5 = myocardial infarction; < 0.5 = no myocardial infarction). The dual-trained networks were tested on the two low-risk pattern sets (LRA and LRB). This was accomplished programmatically by using both networks simultaneously; one network trained on one of the lowrisk pattern sets (LRA or LRB patterns) and the other network trained on the high-risk pattern sets (HR patterns). The dual network was always tested on the pattern set to which it had not been exposed (ie., LRA patterns tested on the dual network trained on the LRB pattern set). Two diagnostic outputs were generated. The output from the network trained on the low-risk pattern sets was chosen as the diagnostic output unless it was greater than 0.05. In this case, the output from the second network trained on the high-risk pattern set was used as the diagnostic output. This number was chosen by empirical trials using the training set with cut-off values between 1 and 0.001. An output of 0.5 was again utilized to distinguish between the presence and absence of a myocardial infarction. The entire process is outlined in Figure 2. Each of the low risk pattern sets was also tested on a single network trained on the two sets of the combined low- and high-risk patterns (LRA + HR, LRB + HR). Each low-risk pattern set was tested on the network trained on the combined set that had been formed by the union of the high-risk pattern set with the low-risk pattern set to which it had not been exposed. 3 Results

The low-risk patients were tested because this group represented the typical patient group presenting to the emergency department in whom the diagnosis of acute myocardial infarction would be entertained. The two low-risk pattern sets (LRA and LRB patterns) were tested on the dual network trained on the low-risk pattern set to which it had not been exposed. The dual network correctly identified 39 of 40 patients who

Improving the Accuracy of a Neural Network

777

TWIN

TEST

Figure 2: Separated training and testing algorithm used to develop the dual network system. Table 3: Network Testing.

Myocardial infarction Correct Incorrect Detection rate Nonrnyocardial infarction Correct Incorrect False alarm rate

Dual network

Single network

39

39

1

1

97.50%

97.50%

301

287

5 1.66%

19 6.21%

p

0.003

had sustained a myocardial infarction [detection rate 97.5% (95% C.I.= 95-loo%)] and 301 of 306 patients who had not sustained a myocardial infarction [false alarm rate 1.63% (95% C.I. = 0.2-3.0%)]. The full process of designation and elimination is illustrated in Table 3. The single networks trained on the control combined patterns sets were tested on the low-risk pattern sets to which they had not been exposed. The two single networks identified 39 of 40 patients who had sustained a myocardial infarction [detection rate 97.5% (95% C.I. = 95loo%)] and 287 of 306 patients who had not sustained a myocardial infarction [false alarm rate 6.2% (95% C.I.= 3.5-7.6%)]. The chi-square analysis of the difference in false alarm rates of the single and dual network was carried out by constructing 2 x 2 contingency tables from the summarized results of the experimental and control testing illustrated in

778

William G . Baxt

Table 3. The McNemar symmetry chi-square analysis of this table had a p value of 0.001 and Yates corrected chi-square analysis had a p value of 0.003. 4 Discussion

The purpose of this study was to develop a strategy that could take advantage of the fact that artificial neural networks trained to make categorical decisions can individually be trained to optimize either detection rate or false alarm rate. By utilizing dual networks, one network was used to define those patients who had not sustained a myocardial infarction while the other network was used to define those patients who had sustained a myocardial infarction. The approach used signal strength analysis to determine which network to use to make the final diagnostic decision. The signal or output of the network trained on the low-risk patterns was measured for each pattern the network processed. If the signal level rose above a set threshold, this was taken as an indication that that patient may have sustained a myocardial infarction. At this point, the analysis was shifted to the network trained on the high-risk patterns and, thus, trained to identify the presence of myocardial infarction. Although the empirically chosen signal cut-off of 5.0 x lo-’ appears to be low, it should be pointed out that the mean output generated by the network trained on the low-risk patterns set from patients who had not sustained a myocardial infarction was 1.0 x lop6. Variation of this threshold has a significant impact on the accuracy of the network and care had to be taken to empirically choose the most optimized separation point. By separating the analysis in this manner, the negative reciprocal effects on detection rate and false alarm rate accrued by improving one or the other seems to have been obviated. Although single network processing is inherently highly accurate (Baxt 1991a,b), simultaneous optimization of both detection rate and false alarm rate could be achieved only by use of the two separate weight sets in concert. This strategy yielded a higher detection rate and lower false alarm rate than the use of weight sets derived from any other single optimized patient pattern set. The fact that the separated training and dual analysis imparted a higher accuracy to network function is supported by the result obtained from the control pattern sets. The same training information was available to the single network as to the dual network, yet the single network approach was not able to perform as accurately as the dual network. Separated training appears to impart a higher degree of accuracy to the performance of the artificial neural network in this setting; however, it should be pointed out that the improvement in accuracy is not profound. In other settings this small improvement in accuracy may not be significant. In the setting of the diagnosis of a disease such as myocardial infarction, which is a disease of both low incidence and a disease with

Improving the Accuracy of a Neural Network

779

a major penalty for misdiagnosis, the optimization gained by the use of separated training is highly desirable. Misdiagnosis resultant from a less than optimized detection rate can lead to significant morbidity or even mortality. Because myocardial infarction is a disease of low incidence, a less than optimized false alarm rate will lead to the unwarranted admission of large numbers of patients to the coronary care unit and hospital (Goldman et al. 1988). The major drawback of this study is that this strategy has been tested only retrospectively. This approach must be tested prospectively on a large number of patients before it can be statistically fully validated. It may also be possible to duplicate these results using a single network and single training pattern, since this approach was not fuily exhausted. Furthermore, this methodology was tested only in this specific setting and may not be transportable to other diagnostic settings or other applications. However, if the methodology is validated, it may also be applicable to other applications of artificial neural network technologies. Acknowledgments

I thank Doctors Hal White and David Zipser for their help with the technical aspects of this study and Kathleen James for her help in the preparation of this manuscript. References Baxt, W. G. 1991a. Use of an artificial neural network for data analysis in clinical decision-making: The diagnosis of acute coronary occlusion. Neural Comp. 2,480-489. Baxt, W. G. 1991b. Use of an artificial neural network for the diagnosis of myocardial infarction. Ann. Intern. Med. 115, 843-848. Bounds, D. G., Lloyd, I? J., and Mathew, B. G. 1990. A comparison of neural network and other pattern recognition approaches to the diagnosis of low back disorders. Neural Networks 3, 583-591. Cios, K. J., Chen, K., and Langenderfer, R. A. 1990. Use of neural networks in detecting cardiac diseases from echocardiographic images. I E E E Eng. Med. Biol. Mag. 9, 58-60. Eberhart, R. C., Dobbins, R. W., and Hutton, L. V. 1991. Neural network paradigm comparisons for appendicitis diagnosis. Proceedings of the Fourth Annual I E E E Symposium on Computer-Based Medical Systems, 298-304. Goldman, L., Cook, E. F., Brand, D. A., Lee, T. H., Rouan, G . W., Weisberg, M. C., Acampora, D., Stasiulewicz, C., Walshon, J., Terranova, G., Gottlieb, L., Kobernick, M., Goldstein-Wayne,B., Copen, D., Daley, K., Brandt, A. A., Jones, D., Mellors, J., and Jakubowski, R. 1988. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N . Engl. 1.Med. 318, 797-803.

780

William G. Baxt

Hiraiwa, A., Shimohara, K., and Tokunaga, Y. 1990. EEG topography recognition by neural networks. IEEE Eng. Med. Biol. Mag. 9, 39-42. Hudson, D. L., Cohen, M. E., and Anderson, M. F. 1988. Determination of testing efficacy in carcinoma of the lung using a neural network model. Symposium on Computer Applications in Medical Care 1988 Proceedings: 12th Annual Symposium, Washington, DC 12, 251-255. Kaufman, J. J., Chiabera, A., Hatem, M., et al. 1990. A neural network approach for bone fracture healing assessment. IEEE Eng. Med. Biol. Mag. 9, 23-30. Marconi, L., Scalia, F., Ridella, S., Arrigo, P., Mansi, C., and Mela, G. S. 1989. An application of back propagation to medical diagnosis. Proceedings of the International Joint Conference on Neural Networks, Washington, DC 2, 577. McClelland, J. L., and Rumelhart, D. E. 1988. Training hidden units. In Explorations in Parallel Distributed Processing, J. L. McClelland and D. E. Rumelhart, eds., pp. 121-160. The MIT Press, Cambridge, MA. Mulsant, G. H., and Servan-Schreiber, E. 1988. A connectionist approach to the diagnosis of dementia. Symposium on Computer Applications in Medical Care 1988 Proceedings: 12th Annual Symposium, Washington, DC 12, 245-250. Pozen, M. W., DAgostino, R. B., Selker, H. P., Sytkowski, P. A., and Hood, W. B., Jr. 1984. A predictive instrument to improve coronary-care-unit admission practice in acute ischemic heart disease: A prospective multicenter clinical trial. N. Engl. J. Med. 310, 1273-1278. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in theMicrostructure ofCognition, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-364. The MIT Press, Cambridge, MA. Saito, K., and Nakano, R. 1988. Medical diagnostic expert system based on PDP model. Proceedings of the International Joint Conference on Neural Networks, San Diego 2, 255-262. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S. 1988. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Symposium on Computer Applications in Medical Care 1988 Proceedings: 22th Annual Symposium, Washington, DC 12, 261-265. Stornetta, W. S., and Huberman, B. A. 1987. An improved three-layer, backpropagation algorithm. In Proceedings of the I E E E First International Conference on Neural Networks, M. Caudill and C. Butler, eds. SOS Printing, San Diego, CA. Weigend, A. S., Huberman, B. A., and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. Stanford PDP Research Group, April. Widrow, G., and Hoff, M. E. 1960. Adaptive Switching Circuits. Institute of Radio Engineering Western Electronic Show and Convention. Convention Record, Part 4,96104. Yoon, Y. O., Brobst, R. W., Bergstresser, P. R., and Peterson, L. L. 1989. A desktop neural network for dermatology diagnosis. J. Neural Network Comp. 43-52.

Received 22 October 1991; accepted 6 February 1992.

This article has been cited by: 2. M.A. Moussa. 2004. Combining Expert Neural Networks Using Reinforcement Feedback for Learning Primitive Grasping Behavior. IEEE Transactions on Neural Networks 15:3, 629-638. [CrossRef] 3. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 4. Klaus Prank, Clemens Jürgens, Alexander von zur Mühlen, Georg Brabant. 1998. Predictive Neural Networks for Learning the Time Course of Blood Glucose Levels from the Complex Interaction of Counterregulatory HormonesPredictive Neural Networks for Learning the Time Course of Blood Glucose Levels from the Complex Interaction of Counterregulatory Hormones. Neural Computation 10:4, 941-953. [Abstract] [PDF] [PDF Plus] 5. Kamal M. Ali, Michael J. Pazzani. 1996. Error reduction through learning multiple descriptions. Machine Learning 24:3, 173-202. [CrossRef] 6. Vassilios Petridis , Athanasios Kehagias . 1996. A Recurrent Network Implementation of Time Series ClassificationA Recurrent Network Implementation of Time Series Classification. Neural Computation 8:2, 357-372. [Abstract] [PDF] [PDF Plus] 7. William G. Baxt, Halbert White. 1995. Bootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial InfarctionBootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial Infarction. Neural Computation 7:3, 624-638. [Citation] [PDF] [PDF Plus] 8. David Hamilton, Peter J. Riley, Ueber J. Miola, Ahmed A. Amro. 1995. A feed forward neural network for classification of bull's-eye myocardial perfusion images. European Journal of Nuclear Medicine 22:2, 108-115. [CrossRef] 9. Michael L. Astion, Mark H. Wener, Ronald G. Thomas, Gene G. Hunder, Daniel A. Bloch. 1994. Application of neural networks to the classification of giant cell arteritis. Arthritis & Rheumatism 37:5, 760-770. [CrossRef] 10. William G. Baxt. 1993. A neural network trained to identify the presence of myocardial infarction bases diagnostic decision on nonlinear relationships between input variables. Neural Computing & Applications 1:3, 176-182. [CrossRef]

ARTICLE

___ Communicated by Richard Lippmann

Rule-Based Neural Networks for Classification and Probability Estimation Rodney M. Goodman Charles M. Higgins John W. Miller Depv tli??f I t of E/ectvicn/ E q i i i t w l I is, Cnlifomin Irisfihfe of E c l i i ~ o / o ~PIo~s,n d ~ w iCA , 92125 USA

In this paper we propose a network architecture that combines a rulebased approach with that of the neural network paradigm. Our primary motivation for this is to ensure that‘the knowledge embodied in the network is explicitly e‘ncoded in-the form of understandable rules. This enables the network’s decision to be understood, and provides an audit trail of how that decision was arrived at. We utilize an information theoretic approach to leafning a model of the domain knowledge from examples. This model takes ;he form of a set of probabilistic conjunctive rules between discrete input evidence variables and output class variables. These rules are then mapped onto the weights and nodes of a feedforward neural network resulting in a directly specified architecture. The network acts as parallel Bayesian classifier, but more importantly, can also output posterior probability estimates of the class variables. Empirical tests on a number of data sets show that the rulebased classifier performs comparably with standard neural network classifiers, while possessing unique advantages in terms of knowledge representation and probability estimation. 1 Introduction

The rule-based knowledge representation paradigm is well established as a powerful model for higher level cognitive processes (Newell and Simon 1972; Chomsky 1957), whereas the connectionist paradigm seems very well suited to modeling lower level perceptual processes. In particular, rule-based expert systems have proven to be a successful software methodology for automating complex decision-making tasks. Primary advantages of this approach include the Facility for explicit kiiozuledge re;’reseiztatiori in the form of rules and objects, and the ability of a rule-based Nrirval Coiiipirtatiorr 4, 781-804 (1992)

@ 1992 Massachusetts liistitute of Technology

782

Rodney M. Goodman et al.

system’s reasoning to be understood by humans. However, current rulebased systems are firndmentally restricted in speed of execution, and hence in their applicability to real-time systems, because of the serial computations performed in present inference processing schemes. In addition, current rule-based systems are brittle in their ability to deal with the uncertainties inherent in real-world information and lack any ability to generalize to novel problems. Neural network paradigms, on the other hand, are typically quite adept at modeling problems that occur in pattern recognition, visual perception, and control applications. This ability is due (at least in part) to their inherent robustness in the presence of noise, the lack of which plagues the implementation of rule-based systems in practice. In addition, neural networks are inherently parallel, and special-purpose parallel neural network hardware implementations promise quantum leaps in processing speeds, suitable for real-time systems. However, neural networks, as presently implemented, are poor at explaining their reasoning in human understandable terms because they embed domain knowledge in the implicit form of weights and hidden nodes. The network is thus very much of a “black-box” solution, whose structure and reasoning are relatively inaccessible to higher level reasoning or control processes, such as the human user. In many areas of expertise such as medical, legal, or life-critical domains, it is an absolute requirement that an autonomous reasoning agent be able to explain its decisions to a higher level authority such as a judge. We are therefore led to ask whether it is possible to amalgamate the rule-based and connectionist approaches into a hybrid scheme, combining the better aspects of both, while eliminating the drawbacks peculiar to each in the process. A natural common ground on which to combine these approaches is that of probability. We show that by referencing both rule-based systems and neural networks to the common normative frame of probability, a novel and practical architecture emerges. In this paper we propose a hybrid rule-based connectionist approach that overcomes many of the problems outlined above. Our ultimate goal is the automatic learning of rule-based expert systems that can perform inference in parallel when implemented on neural network architectures. For the purposes of this paper, however, we concentrate on the problem of classification and posterior probability estimation, implemented on rule-based feedforward neural nets. We show how probabilistic rules can be used as a natural method for describing the high-order correlation information in discrete (or categorical) data, and how the hidden units of a feedforward net can easily implement such rules. Furthermore, we show how information theory and minimum description length theory can be used to learn only the most important of these rules, thus directly specifying the network architecture in terms of hidden units and connectivity. Finally, we show that output probabilities may be estimated using a parallel Bayesian approach, which is a natural extension of a first-order Bayes classifier. The architecture proposed in this paper is

Rule-Bascd Neural Networks

783

therefore novel for a number of reasons. First, it avoids iterative network training processes (such as backpropagation) by dimctl!y specifying network weights in terms of probability estimates derived from the example data. Second, the hidden nodes of the network are automatically learned from the data without having to specify the number of such nodes. This approach leads to the advantage that network parameters are directly interpretable in terms of rules with associated weights of evidence between the nodes. Third, given that it is usually necessary to assume some form of conditional independence among the input variables in order to render the probability estimation problem tractable, the proposed classification ~~t independence scheme is novel in that it uses d f ? f a - d e p " ~ [ ~ econditional assumptions only to the extent jiisfifieil by the data. Networks that learn from example form the basis of many current connectionist paradigms. The success of the backpropagation (Rumelhart ct a / . 1986) and related algorithms is that, given a specific architecture in terms of input, hidden, and output nodes, the connection weights between these nodes needed to model the high-order correlations in the example data can be easily learned. Learning the network nrclrifccfure itself, and generating true output probability estimates is a considerably more difficult task for current neural network paradigms. It is interesting to note that Uttley (1959), conceived of a network in which all higher order input-output correlations were stored. This network stored a number of probabilities exponential in the number of input variables, but contained the information necessary for calculating the conditional probability of any set of output variables, given any other set of input variables. In principle, this provided a method of calculating output probabilities at the expense of exponentially many of what we would now call hidden units, many of which were redundant in the sense of not contributing to the output information. Networks whose architectures include high-order connections chosen rnridoiiily were of course among the very early neural network models (Rosenblatt 1962; Aleksander 1971). At the other extreme, in a previous paper we showed how simple first-order correlations could be used to successfully predict output probabilities (Goodman r t a / . 19891, provided the data were well specified by such low-order information. Between these extremes lie approaches that make subjective prior judgments about conditional independence to decide 7diicI1 higher order conjunctive probabilities to store, such as the Bayesian networks described by Pearl (1988), Lansner and Ekeberg (19891, and Kononenko (1989). This paper develops in the following way. First, we outline our rulebased network architecture. Second, we describe our methodology for learning a set of probabilistic production rules from example data, using an information theoretic appruach. Third, we show how these rules are then mapped onto the nodes and links of a feedforward neural network in such a manner that the network computes posterior class probabilities using a Bayesian formalism. We conclude with a comparative evaluation

Rodney M. Goodman et al.

784

of the approach using five data sets, including medical diagnosis and protein structure prediction. 2 A Rule-Based Classifier Architecture

We consider the problem of building a classifier that relates a set of K discrete feature variables (or attributes) comprising the set Y = { YI. . . . . YK} to a discrete class variable X. Each attribute variable takes values in the alphabet {y!. . . . .y;”’}. 1 5 1 5 K, where ml is the cardinality of the lth attribute alphabet. The class variable X takes discrete values from the set {x,. . . . . x,,,},where rn is the cardinality of the class. We also assume that we are given an initial labeled training set of N examples where each example is of the form {Y, = G ,. . . . YK = yi, X = x,}. The supervised learning problem we set ourselves is to learn a classifier that when presented with future unseen attribute vectors (which may be either partial or complete) will estimate the posterior probability of each class. We may then wish to output either these probabilities, or the class variable with the highest probability as the decision made by the classifier. Note that we are particularly interested in real data sets in which the classification is often nondeterministic or noisy, that is, there exists class overlap and hence a fundamental ambiguity in the mapping from Y to X. In this case there is no perfect classifier for the problem and the performance of the classifier as measured by its error rate will be nonzero, and bounded below by the optimal Bayes error rate p:. The rule-based architecture we propose takes the form of a three-layer feedforward network as shown in Figure 1.

Input Atbiibutes

Conjmaive Rulcs

Figure 1: Architecture of the rule-based classifier.

outplc class

Rule-Based Neural Networks

785

The input nodes correspond to each possible attribute-value pair in the input attribute space. The hidden layer consists of a set of lRJ conjunction detector nodes (or AND gates), one for each rule in the set of rules R. These hidden nodes detect the conjunction of one or more input . a conjuncattribute-value pairs of the form { Y1 = Y;. . . . . YI = ~ } When tion is detected the rule fires and the node outputs a 1. When the node is not firing it outputs a 0. The output layer consists of one node for each output class. The action of a rule firing contributes an activation from the hidden rule node into one or more output class nodes. The contribution into the ith output node from rule r, is given by the link weight zu,,,and represents the weight of evidence for the conclusion, given the occurrence of the left-hand side conjunction of attribute values. Each rule node together with its output link can therefore be considered to be implementing an lth-order conjunctive rule of the form

IF { Y1 = Y;. . . . . YI= yr} THEN X

= x,

with STRENGTH zu,,

The rule has a conjunction of input attribute-value pairs on its left-hand side (LHS), and a particular class attribute-value pair on its right-hand side (RHS). The weights w,, can be positive or negative depending on whether the rule supports the truth or falsity of the RHS conclusion. Each output node accumulates the inputs feeding into it from the rules that have fired and outputs a quantity that is a function of the particular activation function and threshold used in the node. Our design problem is then to implement a set of rules and associated weights, together with a suitable set of output activation functions and thresholds, such that the output of each class node is an estimate of the corresponding class probability. 3 Learning Rules Using Information Theory We now consider how to learn the set of rules R from the given training data such that the classifier will operate in the desired manner. Clearly we do not want to implement nll possible conjunctive rules, as the size of the hidden layer will be exponential in the size of the input attributes. Rather we require a sirfflcrcntly good set of rules that allows the network to both load the training data and to generalize to new data while having a performance that approaches the optimum Bayes risk. Alternatively, given a fixed resource constraint in terms of l R I allowed hidden units, we should implement the bcsf (72 rules, according to some "goodness" criterion. Let us rephrase the previously defined rule in terms of a prohnhilistic production rule of the form:

If si then x, with probability p

786

Rodney M. Goodman et al.

where p is the conditional probability p ( x , 1 s,), and sf represents the particular conjunction of attribute-value pairs found in the LHS of the rule. We wish to have a measure of the utility or goodness of such a rule. In a Hebbian sense such a rule might be considered good if the occurrence of the LHS conjunction of variables is strongly correlated with the RHS. Alternatively, such a rule might be considered good if the transition probability p is near unity. For example, a rule with p = 1 indicates a deterministic rule in which the occurrence of s1implies X = x, with certainty. However, we will take an information theoretic approach to this problem, and consider that the goodness of such a rule can be measured by the average bits of information that the occurrence of the LHS sf gives about the RHS X = x,. We have introduced such a measure, called the ]-measure (Goodman and Smyth 1989), which can be defined as

This measure possesses a variety of desirable properties as a rule information measure, not the least of which is the fact that it is unique as a nonnegative measure of the information that si gives about X (Blachman 1968). As can be seen the ]-measure is the product of two terms. The first is p(s,), the probability that the LHS will occur. This term can be viewed as a preference for generality or simplicity in our rules; that is, the left-hand side must occur relatively often for a rule to be deemed useful. The other term is the cross-entropy of X and X given s,, and as such is well-founded measure of the goodness of fit between our a posteriori belief about X and our a priori belief (Shore and Johnson 1980). Hence, maximizing the product of the two terms, ](X; s,), is equivalent to simultaneously maximizing both the simplicity of the hypothesis si and the goodness of fit between sf and a perfect predictor of X. The simplicity of s, directly corresponds to the number of attribute-value conjunctions in the rule LHS, that is, the rule order. Low-order rules have less LHS conditions, and thus a higher p(s,). There is a natural trade-off involved here because, typically, one can easily find high-order rules (less probable s,s) that are accurate predictors, but one has a preference for more general low-order rules (more probable s,s). The ]-measure thus provides a unique method of not only ranking the goodness of a set of rules, but also being able to tell if a more specialized rule (one with more LHS conditions) is better or worse than a more general rule. This basic trade-off between accuracy and generality (or goodness of fit and simplicity) is a fundamental principle underlying various general theories of inductive inference (Angluin and Smith 1984; Rissanen 3989). We use the ]-measure in a search algorithm (Smyth and Goodman 1992) to search the space of all possible rules relating the LHS attributes Y to the RHS class X and produce a ranked candidate set S of the IS1 most informative rules that classify X. The search proceeds in a depth

Rule-Based Neural Networks

787

first manner starting with a particular LHS conjunction and progressively specializes the rule until bounds indicate that a higher measure cannot be achieved by specializing turther. The search is potentially exponential but in practice is highly constrained by small sample estimators and information theoretic bounds that heavily penalize searching higher order rules (empirical results demonstrating this effect are given in Smyth and Goodman 1992). In addition, higher order rules that have lower information content than a corrcywnciing lower order (more general) rule can be omitted from the final rule list. From this large candidate set of rules S we next produce the final set of rules R, which will be implemented in the classifier. 4 Rule Pruning Using a Minimum Description Length Model

~

We have already described how we find an initially large candidate set of rules S that models tlie data. I t is well known, both empirically and from theory, that there is a trade-off between the complexity of the model and the quality of generalization performance. A model that is too simple will not have the representational power to capture the regularities of the environment, whereas a model that is too complicated may well overfit the training data and generalize poorly. When we speak here of generalization we are referring to the system’s mean performance in terms of classification accuracy (or a similar function) evaluated over some infinitely large independent test data set. The notion of Occam’s razor has been used to promote model parsimony: choose the simplest model that perfectly explains the data. Unfortunately this presupposes that there exists a model under consideration that can explain the data perfectly in this manner. In practical problems this is unlikely to be the case, siiice there is often a n ambiguity in tlie mapping from attribute space to the class labels. In this stochastic setting a more general version of Occam’s razor has been introduced (Rissanen 1984, 1987) under the title of minimum description length (MDL). The MDL principle is simple to state: choose the model that results in the least description length, where the description length is calculated by first sending a message describing the model [the complexity term L ( M ) ] ,followed by a message encoding the data given the model [the goodness-of-fit term L ( D I M ) ] . Thus we minimize: UMI

+ L ( D 1 Mi

MDL can be viewed as entirely equivalent to a Bayes maximum a posteriori (MAP) principle (where one chooses the model that maximizes the joint probability of the data and the model) by virtue of the fact that the description lengths are directly related to prior probabilities. We will refer primarily to tlie description length framework as it is somewhat more intuitive.

Rodney M. Goodman et al.

788

In the context of applying MDL to the problem at hand we seek a pruned rule set R 2 S, which possesses near-minimal description length among all possible subsets-finding the optimal solution is clearly intractable in the general case. For a more general discussion of search in MDL contexts see Smyth (1991). The algorithm we propose is a simple greedy procedure that, starting from an initially empty set of rules, continues to add the next-best rule to the current set, and terminates at a local minimum of the description length function. In more detail the algorithm is described as follows: 4.1 MDL Rule Pruning Algorithm. 1. Let R = { }.

2. Find the rule r E S such that when R U r is evaluated on the training data as a classifier the sum of the goodness of fit and the complexity of r is minimized. 3. Remove rule r from S .

4. If the description length of length of R then stop.

5. Else let R

= R UY

R U r is greater than the description

and return to step 2 .

At this point in the discussion we can treat the classifier itself as a ”black box” that simply takes a rule set R, a set of unlabeled test data, and produces probability estimates of the class labels. We will describe this ”black box” in detail in the next section. Let us first look at the other part of the algorithm, which we have not defined in detail, namely the calculation of description length. Suppose we have N training samples. For the ith sample, 1 5 i _< N, let xt,,(i) be an index to the true class, i.e., the training label. Let p[xtme(i)]be the classifier’s estimate of this class given the ith attribute vector. Hence, the length in bits to describe the data given the model (the classifier) is

The complexity term, the length in bits to describe the classifier itself, may be arrived at by a number of arguments. In principle we need to send a message describing for each rule its left-hand side component, its right-hand side component, and an estimate of the transition probability of the rule. One of the key factors in proper application of MDL is the precision with which these probabilities are stated. It is a well known general result that very often the optimal precision for model parameters is proportional to fl,or about (1/2) log N bits per parameter. In practice this term dominates as N becomes large over the specification of the rule components. Since these specification terms also depend on the

Rule-Based Neural Networks

789

particular coding scheme (or the prior bias in Bayesian terminology), we choose to ignore these terms in the optimization or search and propose that the complexity be simply proportional to the (3/2) log N precision terms. This penalty scheme has been widely used in practice by other authors (Rissanen and Wax 1988; Tenorio and Lee 1990, et al.). Hence, for rule set R the complexity is assessed as

As we shall discuss later in the section on empirical results, this simple pruning algorithm is very effective at discovering parsimonious rule sets that account for the data. Heuristically, for multivalued class problems, we can understand the behavior of the algorithm as initially trying to account for each class by a single accurate rule for each, and then integrating rules that cover regions of the attribute space with high accuracy. In particular, as we shall discuss in the next section, by evaluating the performance of the classifier on each candidate rule, we can match the rule set to the nature of the classifier (conditional independence in this case). If I R I is the number of rules in the final rule set then it is straightforward to show that the complexity of the pruning algorithm approximately scales as NISIIRI’. Typically IR(<< IS[,the number of rules in the initial rule set. It is difficult to bound 1721 accurately (since it depends on the complexity of the particular classification problem), however, empirical results suggest that it often grows sublinearly with N, perhaps as slowly as logN. 5 Derivation of the Classification Equations

In this section we describe how the network uses the learned rule set to estimate the class probabilities, given a particular set of evidential attribute-values. As before we have m classes x l . . . . . x,,, and a rule set R. As discussed earlier each rule r, E R specifies a particular Ith-order lefthand side attribute conjunction s,, a class x,, and the transition probability p ( x , I s,), where we recall that s, denotes a particular conjunction of input attribute-value terms. For a particular input vector {yl.. . . . y ~ } ,a certain subset of rules F C R is said to ”fire,” i.e., F is the set of rules whose left-hand sides logically evaluate to true, or, in neural terms, the set of hidden nodes that are activated. The problem is simple: given only knowledge of p ( x , [ s I ) , 1 5 j 5 IF\, how can we estimate p(x, 1 sl. . . . . s l r l ) ? In principle there are strong arguments for using a maximum entropy solution, i.e., viewing the p ( x , I s,) and the particular input vector {y,. . . . . y K } as a set of constraints and maximizing the entropy of the joint distribution subject to these constraints (Cheeseman 1983; Miller and Goodman 1990).

790

Rodney M. Goodman et al.

However, the direct solution of this nonlinear optimization problem is unattractive from an implementation viewpoint, being both computationally intensive and unnatural to integrate into a system based on explicit knowledge representation. A better approach in this context is to make a particular simplifying assumption, resulting in a maximum entropy solution that can be directly expressed in closed form (in terms of the component rules). This key assumption is that the left-hand side conjunctions are conditionally independerzt given the class, i.e., for any pair of rules rl and rk E R,which refer to the same class x,,we have

p(s,.sl, 1 xi) = p ( s , I xi)p(sk I x i )

(5.1)

As described in the previous section, the rule set R is formed from a large candidate set of rules in a manner such that rules that obey this conditional independence assumption are included and those which violate the assumption are not. Hence, we find a classifier that uses conditional independence only insofar as it can be justified by the training data-this is considerably more robust than making a priori assumptions about independence without any knowledge of the data. Assuming conditional independence of attributes given the class is well motivated, as discussed by Pearl (1988). By Bayes' rule we have that

[by the conditional independence assumption in (5.1)]

(by Bayes' rule) Let us define the weights w,, as

a bias term for each class as

t, = logp(x1) and an (as yet) undetermined constant

Rule-Based Neural Networks

79 1

Hence, we get that

(5.2) Equation 5.2 allows us to estimate the log posterior probability for each output class. From this, the actual probability can be computed, or a classification decision can be made by simply choosing the largest estimate as the output class. Equation 5.2 admits a direct and intuitive interpretation of the operation of the classifier. First, we can ignore the unknown constant C because i t can be eliminated by the constraint that the sum of the posterior estimates must equal 1, as shown in the next section. Thus, in the absence of any rules firing (IF1= 0) the estimate for each class is given by the bias value t,, namely the log of the prior probability of the class x,.Given a set of rules that fires, each rule contributes a “weight of evidence” into its corresponding output class. This weight of evidence zoII has a direct interpretation as the evidential support for the class provided by the rule-a positive weight implies that the class thus provide is true, while a negative weight implies it is false. The w,, the user with a direct explanation of how the classification decision was arrived at. Each class estimate is then computed by accumulating the “weights of evidence” incident on each class from the rules that fire, which can be done in a parallel manner. We can relate our weights of evidence to those proposed by Good (1950) and Minsky and Selfridge (19611, namely, our w,, are what Good termed “relative weights of evidence” for the case of multivalued classes. This weights of evidence classifier (which is relatively well known and has appeared in various guises in recent decades) is an intuitively elegant implementation of a linear reasoning scheme, as pros and cons for a particular hypothesis (or class) are tabulated in an additive manner. 6 Neural Architecture

~~

The classification procedure given by equation 5.2 can now be mapped onto the three-layer feed forward network architecture shown in Figure 2. The input layer contains one node for each input attribute value. The hidden layer contains one node for each rule. These nodes are effectively AND gates that output ‘1 1 if the left-hand side of the rule is satisfied, and a 0 otherwise. The third layer contains a node for each value of the class attribute. Each second-layer node representing rule i is connected to a third-layer node j via the multiplicative weight of evidence 7 u f J . Also feeding into and summed by each third-layer node is the bias value t,. The sum of activations into this node given by

Rodney M. Goodman et al.

792

1

YI yl

1

Yl I

Y1

Y:

y2

3 y2

I

YK

YK 2 YK

Input Attributes

Summation and Exponentiation

Conjundive Rules

Figure 2: Neural network architecture. is then exponentiated to produce the node output:

0’,=

e ‘ l

= epc

. p(xl 1 s l . . . . , slF1)

The output of the exponentiators is then fed into a normalization layer that constrains each output to satisfy: 0’1 0, = ___ cp=jO’k

This effectively removes the constant C from each input 0’, = cC. p(x, I sl, . . . slFi) producing the output probability estimate 0, = p ( x , I sl. . . . s l ~ j as ) desired. If required, a winner-take-all stage can be added to decide on the the most likely class. It is interesting to note that for the special case of a binary class variable, rn = 2, the resulting circuit may be considerably simplified to that shown in Figure 3. In this case the output is a single node that accumulates the weights of evidence for one class value and against the other. The rule weights incident on the output are then ~

~

and the bias is the log-odds of X,

The exponentiation and normalization steps are combined by noting that for this binary case 0’1

01 = 0 1 ,

-

-

+ 0’2

1 1 + e-(ui-uz)

Rule-Based Neural Networks

793

Conjunctive Rules

Input Anributes

Figure 3: Binary class architecture. resulting in the output node having the well-known sigmoid activation function. If a classification decision is required, the sigmoid is simply replaced with a hard-limiting node which switches at zero input activation to output 1 for class s1 = true and 0 for class s1 = false. Without getting into details, it is important to note that the above neural architecture is well suited to VLSI implementation. I n particular, the weights for the first layer are binary, and the hidden units are simply AND gates. Analog weight storage is required only for the second layer weights, and there are typically many fewer of these than there are first layer weights. Also, exponentiation can easily be performed in VLSI by using MOS transistors in their subthreshold region (Chiueli and Goodman 1990), and the final normalization stages can be performed by variants of the winner-take-all circuit (Lazzaro et nl. 1990). 7 Empirical Results -~

~~

We now compare the pcrtormance of the proposed classifier with that of two other classifiers, namely a backpropagation-trained neural network and a first-order Bayes model. It is important to point out tliat the primary goal of these experiments was to see if our rule-based classifier yielded coinparnblc performance (in terms of classification accuracy) when compared to standard alternative approaches, rather than demonstrably siqwiov performance. It is well known tliat most well-founded classifier algorithms will come reasonably close to the optimal Bayes classification rate on most reasonable problems (Weiss and Kapouleas 1989; Lee and Lippmann 1990). Hence the goal of the empirical evaluation is to

794

Rodney M. Goodman et al.

test whether the rule-based classifier can achieve similar near-optimal performance over a number of different problems. From this point onward we will refer to the first-order Bayes classifier model as the ”first-order classifier,” and to the backpropagation trained feedforward neural network as the ”neural network.” A firstorder classifier is a special case of our rule-based network, where the network architecture consists of no hidden units, i.e., the model consists of all possible first-order rules with the weights defined exactly as for the rule-based network as described earlier. This model is also known as a “naive Bayes” model (Kononenko 1989) and amounts to assuming that the joint distribution of the class and the attributes can be factored into first-order terms. We chose this model for comparison purposes both because of its simplicity and the fact that it often provides better classification performance than one has any right to expect. The particular neural network design algorithm that we used was the conjugate-gradient scheme of Barnard and Cole (1989). Each network has three layers, the first layer containing a single node for each attribute, and the third layer containing a node for each class. The size of the hidden unit layer was typically chosen to be roughly twice the number of input nodes. One of the current problems with neural network techniques is the arbitrary choices that must be made in terms of architecture selection. If one chooses too few hidden units, the network may have too limited a hypothesis space to learn the required concepts, while with too many it may overfit on the training data. Typically, however, for the data sets considered here, we found only minor variations in classification accuracy as long as the number of hidden units was of the same order as the number of input units. We evaluated the performance of the algorithms on five data sets. Two of the data sets are synthetic (LED digits, and a Boolean function), while the other three are real-world data sets (congressional voting, medical diagnosis, and protein secondary structure). The first data set is the well-known LED digits classification problem with 10% noise added. Essentially this consists of a seven-segment LED, where the seven segments correspond to seven binary features, and the digits ”0” though “9” represent 10 classes. The 10% noise consists of reversing the segments from their true value with a probability of 0.1. This renders the classification problem somewhat nontrivial, since the optimal Bayes classification accuracy can be shown to be about 74% (as opposed to 100% for the noiseless case). We generated a database of size 1000 (with 10 equally likely classes) to use for evaluation. The second data set consists of 435 voting records from a session of the 1984 United States Congress. The attributes correspond to 16 different issues on which the politicians voted, such as aid to the Nicaraguan contras and budget cuts. The class variable is party affiliation, i.e., Democrat or Republican. Recognition accuracy up to 95% is known to be achievable on this data set using only a single attribute, the p h y s i c i r r r r - ~ ~ attribute. ~ - ~ ~ ~ ~ z ~Hence, as suggested by Bun-

Rule-Based Neural Networks

795

Table I: Comparative Perforinance Results on Three Data Sets. Mean percentage accuracy iSD Data set

LED digits Voting Boolean

Trivial 10.0 61.4 64.7

First-order

Neural

*

Mean rule

Rule-Based Complexity i SD

7 4 . l f 5.3 72.2 4.96 73.1+ 5.03 87.441 9.66 87.68f 7.06 88.18f 4.41 66.66k 5.41 89.99& 2.33 89.061 2.44

40 dr 1.27 2.2 i 0.40 11.7 f 0.48

tine (1991) and others, the problem is made more interesting by removing this attribute. On the modified data set, Buntine reports accuracies u p to 89%' using a variety of decision tree techniques. The third data set is artificially generated with size 640, where there are 6 binary attributes, YI.. . . . Yband the class is the Boolean function

X

OR[XOR(YI.Yz). AND(Y?.YJ).AND(Y;.Yh)]

To introduce noise, the class variable X has a 10% random chance of being reversed from its true state. Hence the optimal recognition rate on this problem is 90%. The fourth data set is a real database of breast cancer diagnosis data collected at the University of Wisconsin Hospitals between January 1989 and July 1990. We will describe this data set in more detail later since tlie performance of tlie classifiers was evaluated in an incremental manner as if they had been run as the data were collected (in chronological order). The fifth data set is a protein secondary structure problem, also described in more detail later. For the first three data sets we use the standard evaluation technique of V-fold cross-validation where V was chosen to be 10. This means that the LED, voting, and Boolean function data sets were divided into disjoint test sets of size 100, 43, and 64, respectively. The neural network was a three-layer feedforward network with sigmoid activation functions, and 25, 20, and 8 hidden units for the LED, voting, and Boolean function problems, respectively. Both the mean and standard deviation of the resulting CV estimates are reported in Table 1. In addition we tabulate the mean complexity of tlie rule-based classifier for each data set, in terms of the mean (over the different training sets) number of weights connected to the output layer (tlie number of rules in tlie classifier). One column of the table corresponds to the mean accuracy obtainable for each data set simply by the trivial strategy of always predicting the most likely class label. Performance on the data is roughly equivalent between the classifiers except for the first-order model on the Boolean function data-one of the motivations for including this data set was to demonstrate the limitations of the first-order model in capturing such high-order concepts. Hence, we can conclude that tlie rule-based model achieves roughly comparable classification accuracy to the more usual backpropagation model.

796

Rodney M. Goodman et al.

The fourth data set considered is the aforementioned medical diagnosis database. A common technique in breast cancer diagnosis is to obtain a fine needle aspirate (FNA) from a patient under examination. Wolberg and Mangasarian (1990) describe the domain in some detail. The FNA sample is evaluated under a microscope by a physician who makes a diagnosis. All patients evaluated as malignant, and some of those labeled as benign, later undergo biopsy, which confirms or disconfirms the original diagnosis-the other patients diagnosed as benign undergo later reexamination to provide a true measurement of their condition. Since biopsy is roughly eight times as costly as the FNA technique, it is important that unnecessary biopsies be kept to a minimum. In addition, Wolberg and Mangasarian report that physicians encounter borderline cases making diagnosis difficult. The approach taken by Wolberg and Mangasarian was to collect training data in the form of nine subjectively evaluated characteristics of the FNA sample for each patient. These feztures describe general characteristics of the FNA sample as seen under a microscope, such as uniformity of cell size, marginal adhesion, and mitoses. Ground truth in the form of class labels (benign or malignant) was obtained at a later stage by a biopsy or reexamination. A classifier was then designed that takes the physician’s description of the FNA sample and produces a diagnosis. In Wolberg and Mangasarian (1990) a successful linear programming technique is introduced for determining the parameters of a neural network classifier for this diagnosis problem. For our evaluation purposes we used the same database that consists of 535 patient records. As described above there are 9 attributes, each of which takes on a discrete value between 1 and 10. We chose to evaluate classifier performance by training the performance of each classifier on the first k x 50 samples (where 1 5 k 5 10) and testing on the remainder. This gives an idea of the performance as a function of training sample size and is also closer to the manner in which a classifier would be used in practice since the database of patient records is in chronological order. The results are shown in Figure 4. Clearly beyond about 150 training samples, all of the classifiers perform equally well. The excellent performance of the first-order classifier, and the fact that near 100% accuracy can be attained, leads one to suspect that the problem is not a difficult one in terms of classifier design. The relatively poor performance of the rule-based classifier for small sample sizes deserves some comment. In effect the MDL nature of the classifier design algorithm ensures that the model is conservative in its use of parameters when there are few data available. In contrast, both the first-order model and the neural network had a fixed, relatively complex, architecture independent of the amount of training data-the neural network used a single hidden layer with 12 hidden units throughout. In theory, for small sample sizes, both of these networks are too complex to be plausible models in a statistical sense. This phenomenon has been observed elsewhere by Cybenko (1990) and Smyth (1991). Nonetheless, in practice, overcomplex models can outper-

Rule-Based Neural Networks

797

9’ *BaCkDrO!JagatlOn

network

- - X . F i r s t - o r a e r Bayes network

-E-UnDruned

r u l e set

75 ..

0 I

I

I

I

Figure 4: Medical database-performance. form the more theoretically correct, simpler, models on a particular data set. In particular we note that the performance of our rule-based classifier when used with the more complex (in terms of more rules) unpruned rule set also performs comparably to the other techniques. Classification accuracy alone is not, however, the only figure of merit of interest. We are particularly interested in the ability of our scheme to provide an estimate of the classifier’s confidence in its decision, by using the output probability estimates provided along with the classification decision. For the rule-based and first-order Bayes networks these probabilities are produced directly. For the backpropagation network we normalize the output activation to produce a probability estimate. Figure 5 shows the mean binary entropy computed using these probabilities, for each classifier’s decisions on the medical database, as a function of sample size. Entropy provides a measure of the classifier’s uncertainty in its decision, and ranges from 0 (completely certain) to 1 bit (maximally uncertain). In practice this uncertainty estimate can provide a useful confidence indicator to a higher level decision maker. Two cases are shown for each classifier. One case corresponds to the uncertainty

Rodney M. Goodman et al.

798

1

08

d - - R u l e - b a s e d network - c o r r e c t Rule-based network - incorrect

-+-

- A -BackDroDagation network - c o r r e c t

06

-+- BackDroDagation network - Incorrect

1

-.y

;

-

YI,.

. > . ...,./. ):.

ji.

k;+ :i:.!.,w3J.-k

‘?c:

.‘.,.I

*=irC;t-order Baves n e t w c r r - incorrect

w

0 4 Y

................ ................--

A--’

02

,Q

\ .E’ /.

0

.................. 0

50

. .

..,. l

100

. . .

+

..

,.

.x

...

.

150

200

250

.....

...,..................x

u-. . . . x . . ............

300

150

--

400

Training Set Size

Figure 5: Medical database-classifier uncertainty. when the classifier’s decision was correct, and the other corresponds to the uncertainty when the decision was incorrect. Ideally we would like a classifier to have a low uncertainty (near 0) when it makes a correct decision, and a high uncertainty (near 1) when it makes an incorrect decision. Consider first the three curves that indicate incorrect decisions. From Figure 5 we see that the rule-based system performs well in this regard, being near maximal uncertainty when it makes a wrong decision over the entire range of sample sizes. The neural network does not perform as well. It begins with a reasonable degree of uncertainty and then becomes more definite (in its mistakes) as the training size increases. The first-order Bayes classifier is initially quite definite in its mistaken conclusions, but begins to become more reasonable as the sample size increases. This effect is likely due to the fact that its model becomes more accurate as more data become available (in terms of probability estimates). Also shown are the uncertainty curves for the correct decision. The rule-based and backpropagation networks are comparably low as desired, with the first-order Bayes classifier being even more confident in its decisions. In Table 2 we list the the actual rules obtained when the algorithm was trained on the first 400 samples of the data set. This set of 11 rules is the final set obtained by the MDL portion of the algorithm after the

Rule-Based Neural Networks

799

Table 2: Medical Database-Rules. Strength i",,

/-Measure Rule

~

0.297 0.289 0.271

0.231 0.145 0.111 0.103 0.085 0.057 0.056 0.045

cell size uniformity 1 IF AND mitoses 1 IF bare nuclei 1 AND iiormal nucleoli 1 IF epithelial cell size 2 AND bare nuclei 1 IF bare nuclei 10 IF clump thickness 10 cell sim uniformity 10 IF IF normal nucleoli 10 margind adhesion 10 IF IF cell size uniformity 5 IF epithelial cell size 10 IF bland chromatin 8

THEN

DB"

5.9

THEN

DB

6.2

THEN

DB

8.0

THEN

DM" DM DM DM DM DM DM DM

THEN THEN THEN THEN THEN THEN THEN

-4.4 -5.7 -5.3 -5.2 -4.2 -4.5 -3.8 -4.2

"Diagnosis benign. "Diagnosis malignant.

rule search procedure liad initially found a candidate set of 500 rules. The rules are ranked in order of decreasing average information content. The rules that confirm the benign condition (positive weights) are somewhat more informative than those that conclude the malignant condition (negative weights), primarily because the malignant rules have a lower prior probability of occurrence, i.e., the left-hand side conditions are less likely. Figure 6 shows a diagram of the network that results when the rules are implemented on a neural architecture. Note that there are really only three genuine hidden units ( the AND gates) corresponding to the three second-order rules. The first-order rules do not need a hidden unit and effectively correspond to a single weighted link between the input and output layers. The final data set was chosen to test the rule-based approach on a large database. One of the original successes of the neural network classifier model on a large-scale problem was the secondary structure protein prediction problem, as described by Qian and Sejnowski (1988). The objective of this prediction task is to predict the secondary structure of globular proteins from a knowledge of their sequence of amino acids (the primary structure). The secondary structure is comprised of small groups of residues that join together into recognizable local shapes. The secondary structure is classified into one of three types: "helix," "sheet," and "coil," denoted by "h," "e," and "-," respectively. For our experiment we used the same training and testing data as used in Qian and

800

Rodney M. Goodman et al.

Figure 6: Medical database-network. Sejnowski's paper, which consists of 18,105 training residues and 3520 testing residues. Each "example" in the database consists of a window of 13 contiguous amino acids, six preceding and six following a particular central amino acid. A total of 20 different amino acid fractions appear in the data, each one denoted by an alphabetic character (A, C, D, . . . etc). The objective is to classify the central amino acid into one of the three secondary classes. Prior to Qian and Sejnowski's work, the best results obtained on the protein data set were in the mid-50% accuracy range. The work of Qian and Sejnowski showed that accuracies in the low 60% range were obtainable using neural network techniques: 62.7% for a single network, and 64.3%when correlations between adjacent elements in the sequence were taken into account using a secondary cascaded network. Subsequent studies by other authors using network models achieved similar accuracies. Indeed Stolorz et al. (1991) recently showed that a first-order Bayes classifier can achieve 61.1% accuracy using a window size of 17, indicating that there is limited predictive information in the attributes beyond first-order.

801

Itule-Based Neural Networks Table 3: Protein Database--Kules -.

/-Measure

~~

Strength 70,,

Rde ~~~

0.016 0.011 0.010 0.009 0.007 0.006 0.006 0.006 0.006 0.006 0.005 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004

IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF ~~

priin,iry+l P priinxy+l P primciry+OP primxy+0 G primxy+2 P prim,iry+0 V primary+2 P priniary+0 G primary-1 P prini,iry+0 V primxy+O I prirn,iry+O L primary+l V prim‘iry-1 G prini;lry+O I’ prim,iry+3 P prim,iry+l L priniary+O I primary+O A primary-1 G

THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN

secondary secondary h secondary secondary secondary secondary e secondary h secondary h secondary secondary secondary e secondary secondary e secondary secondary e secondary 11 secondary secondary secondary h secondary h

0.467 1.542 0.394 0.296 0.340 0.309 -0.837 -0.428 0.313 -0.348 0.360 -0.285 0.237 0.212 1.261 - 0.594 -0.265 -0.374 0.594 0.273

-

-

-

We ran our algorithm to find the best first-order model, i.e., using only first-order rules on n window size of 13. The final model contained 194 rules a n d correctly predicted 61.7% of the test samples. In Table 3 w e list the 20 most informative rules from the final pruned network model. It is interesting to note that the best rules tend to involve the central amino acid (primary+O) and the ones nearby at positions primary+l, primary-1, etc., rather than those farther a w a y from the center. The rules also tend to be grouped into triplets of rules with the same lefthand side. These rules tend to be a positive rule for the most probable class (-, with a prior of 0 545), a n d two negative rules for the other two classes. Again, w e note that the point of this experiment was not necessarily to obtain better results than have been previously reported but to demonstrate that results comparable to other ”black-box” techniques can be obtained on a large-scale discrete prediction problem, while achieving useful explainability d u e to the explicit rule-based model. The experimental results confirm that the rule-based classifier is very competitive in terms of classification accuracy when compared with alternative approaches. The results (particularly those for the cancer diagnosis problem) also clearly demonstrate the unique ability of this approach to produce a hybrid rule-based neural network, wherein units and weights

802

Rodney M. Goodman et a].

possess a clear semantic interpretation to the external observer. In domains such as medical diagnosis such a feature makes the likelihood of user acceptance much higher than would be the case with a "black box" algorithm. 8 Conclusions

A novel hybrid rule-based connectionist classifier architecture has been proposed. The architecture of the classifier is directly derived from the example data by an efficient information-theoretic search technique. The classification performance of the hybrid scheme on discrete data has been shown to be comparable with that of conventional neural network classifiers, and the resulting network exhibits an explicit knowledge representation in the form of human-readable rules. Acknowledgments This work is supported in part by Pacific Bell, in part by the Army Research Office under Contract DAAL03-89-K-0126, and in part by DARPA under contract AFOSR-90-0199. Part of this research was carried out by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. The authors thank David Aha of U.C. Irvine for providing the voting data set, and Olvi Mangasarian of the University of Wisconsin for providing the medical diagnosis data. References Aleksander, I. 1971. Microcircuit Learnitig Computers. Mills and Boon, London. Angluin, D., and Smith, C. 1984. Inductive inference: Theory and methods. A C M Comput. Surveys 15(3), 237-270. Barnard, E., and Cole, R. 1989. A neural net training program based on conjugategradient optimization. Oregon Graduate Centre Tech. Rep. No. CSE 89-014, Oregon. Blachman, N. M. 1968. The amount of information that y gives about X. I E E E Trans. Inform. Theory IT-14(1), 27-31. Buntine, W. 1991. A theory of learning classification rules. Ph.D. Thesis, University of Technology, Sydney. Cheeseman, I? 1983. A method of computing generalized Bayesian probability values for expert systems. Proceedings of the Eighth hitcrnatioiial \oitit Conference on Artificial Intelligence (Karlsruhe, West German!/), pp. 198-202. Chiueh, T., and Goodman, R. M. 1990. VLSI Implementation of a high-capacity neural network associative memory. In Advances it7 Neural Iiiforrnation Processing 2, D. Touretzky, ed., pp. 793-800. Morgan Kaufmann, San Mateo, CA.

Rule-Based Neural Networks

803

Chomsky, A. N. 1957. Syntiictic Strrictures. Mouton, The Hague. Cybenko, G. 1990. Complexity theory of neural networks and classification problems. In Neitrul Networks, L. Almeida, ed., pp. 2645. Springer Lecture Notes in Computer Science. Good, I. J. 1950. Probability nnd the Wcighirig of Evidence. Charles Griffin, London. Goodman, R. M., and Smyth, P. 1989. The induction of probabilistic rule-setsthe ITRULE algorithm. Prucwdings of the 2989 lriternntionnl Works/rup o n Mochine Learning, pp. 129-132. Morgan Kaufmann, San Mateo, CA. Goodman, R. M., Miller, J. W., and Smyth, P. 1989. An information theoretic approach to rule-based connectionist expert systems. In Ahnrict~sin Nrurnl Inforination Processing S!/stcwis I , D. Touretzky, ed., pp. 256-263. Morgan Kaufmann, San Mateo, CA. Kononenko, I. 1989. Bayesian neural networks. B i d . Cyhcrnet. 61, 361-370. Lansner, A,, and Ekeberg, 0. 1989. A one-layer feedback artificial neural network with a Bayesian learning rule. I n t . I. Neirrnl Netzc~orks1(1), 77-87. Lazzaro, J., Ryckebusch, S., Mahowald, M. A., and Mead, C. A. 1990. Winnertake-all networks of O ( N ) complexity. In Admiices in Neirrnl Infurrnrition Processing 2, D. Touretzky, ed., pp. 703-711. Morgan Kaufmann, San Mateo, CA. Lee, Y., and Lippmann, R. 1’. 1990. Practical characteristics of neural network and conventional pattern classifiers on artificial and speech problems. 11.1 Adzlnnces in Naira/ Znforrrrrifion Pvuc-essing Systems2, D. Touretzky, ed., pp. 168177. Morgan Kaufmaiin, San Mateo, CA. Miller, J. W., and Goodman, R. M. 1990. A polynomial time algorithm for finding Bayesian probabilities from marginal constraints. Proc-rrrfings uf tlie Sixth Conference on Unccvtriint!y i n AI, Cambridge, Massachusetts, July 27-29. Minsky, M., and Selfridge, 0. G. 1961. Learning in random nets. I n Infuriniitiun Tlieory (Fourth Lorzdon Sy~iipusiii~n~, C. Cherry, ed., pp. 335-347. Butterworth, London. Newell, A,, and Simon, H. A. 1972. Hiirniiri Proliltw Solz~ins. Prentice-Hall, Englewood Cliffs, NJ. Pearl, J. 1988. Probabilistic 12vnsoriiiig in Intelligent Systcms. Morgan Kaufmann, San Mateo, CA. Qian, N., and Sejnowski, T. J. 1988. Predicting the secondary structure of globular proteins using neur‘il network models. /. Mol. B i d . 202, 865-884. Rissanen, J. 1984. Univers‘il coding, information, prediction, and estimation. l E E E Trans. Inform. T/wory 30, 629-636. Rissanen, J. 1987. Stochastic complexity. 1. Royd Stat. Suc. B 49(3), 223-239. Rissanen, J. 1989. Stoclinstic. Cornplc.uit!y in Stntisticnl Inquiry. World Scientific, Teaneck, New Jersey. Rissanen, J., and Wax, M. 1988. Algorithm for constructing tree structured classifiers. U. S. Patent no. 4,719,571. Rosenblatt, F. 1962. PrincifJlc~s of NL,irrodyrrnriiics: Pt~rccptrurrs~ 7 i i dtlic TIii20ryof Brnin Meclinnisins. Spartan Books, Washington, DC. Rumelhart, D., Hinton, G., m d Williams, R. 1986. Learning internal representations by error propagation. In lJnrn//e/D i s t r i h i t d Prucessiri~7 , D. E. Rumelhart and J. L. McCLelLind, eds., p. 318. The MIT Press, Cambridge, MA.

804

Rodney M. Goodman et al.

Shore, J. E., and Johnson, R. W. 1980. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. I E E E Trans. Inform. Theory IT-26(1), 26-37. Smyth, I? 1991. On stochastic complexity and admissible models for neural network classifiers. In Advances in Neural Inforinntion Processing 3, D. Touretzky, R. Lippman, and J. Moody, eds., pp. 818-824. Morgan Kaufmann, San Mateo, CA. Smyth, I?, and Goodman, R. M. 1992. An information theoretic approach to rule induction from databases. I E E E Trans. Krzozuledge Data Engineer. 14(4), 301-316. Stolorz, P., Lapedes, A,, and Xia, Y. 1991. Prrdictirzg protein secoridary structure using neural net a i d statistical methods. Tech. Rep. LA-UR-91-15, Los Alamos National Laboratory, Los Alamos, New Mexico. Tenorio, M. F., and Lee, W. T. 1990. Self-organizing network for optimum supervised learning. I E E E Trans. Neural Networks 1(1), 100-110. Uttley, A. M. 1959. The design of conditional probability computers. Inforin. Control 2, 1-24. Weiss, S. M., and Kapouleas, I. 1989. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Procecdirigs ofIJCAI 1989, Morgan Kaufmann, San Mateo, CA. Wolberg, W. H., and Mangasarian, 0. L. 1990. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Nntl. Acad. Sci. U.S.A. 87,9193-9196.

Received 19 April 1991; accepted 26 March 1992.

This article has been cited by: 1. H.-L. Tsai, S.-J. Lee. 2004. Entropy-Based Generation of Supervised Neural Networks for Classification of Structured Patterns. IEEE Transactions on Neural Networks 15:2, 283-297. [CrossRef] 2. M. Muselli, D. Liberati. 2002. Binary rule generation via Hamming Clustering. IEEE Transactions on Knowledge and Data Engineering 14:6, 1258-1268. [CrossRef] 3. Shie-Jue Lee, Chun-Liang Hou. 2002. An ART-based construction of RBF networks. IEEE Transactions on Neural Networks 13:6, 1308-1321. [CrossRef] 4. N.K. Kasabov, Qun Song. 2002. DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems 10:2, 144-154. [CrossRef] 5. Shie-Jue Lee, Mu-Tune Jone, Hsien-Leing Tsai. 1999. Constructing neural networks for multiclass-discretization based on information entropy. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:3, 445-453. [CrossRef] 6. Mu-Chun Su, Chih-Wen Liu, Chen-Sung Chang. 1999. Rule extraction for voltage security margin estimation. IEEE Transactions on Industrial Electronics 46:6, 1114-1122. [CrossRef] 7. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus] 8. M. Kubat. 1998. Decision trees can initialize radial-basis function networks. IEEE Transactions on Neural Networks 9:5, 813-821. [CrossRef]

ARTZCLE

Communicated by Gerald Tesauro

Complex Scheduling with Potts Neural Networks Lars GislCn Carsten Peterson Bo Soderberg Deprtrriciit of Tlicorc~tirnlPhysics, Uriiiwsity of Lurid, Sb'lzli,ptnll 14A, S-22362 L I I I ~S,i d r r l

In a recent paper (Gislen et al. 1989) a convenient encoding and an efficient mean field algorithm for solving scheduling problems using a Potts neural network was developed and numerically explored on simplified and synthetic problems. In this work the approach is extended to realistic applications both with respect to problem complexity and size. This extension requires among other things the interaction of Potts neurons with different number of components. We analyze the corresponding linearized mean field equations with respect to estimating the phase transition temperature. Also a brief comparison with the linear programming approach is given. Testbeds consisting of generated problems within the Swedish high school system are solved efficiently with high quality solutions as results. 1 Introduction

The neural network (NN)approach has shown great promise for producing good solutions to difficult optimization problems (Peterson 1990). The NN approach in this problem domain is very attractive from many perspectives. The mappings onto NN are often extremely straightforward to do implying very modest software development. Parallel implementations are natural. Furthermore, by using the mean field theory (MFT) approximation a set of deterministic equations replace the CPU demanding stochastic updating procedures often used to avoid getting stuck in local minima. Most activities have focused on "artificial" applications like the graph partition and traveling salesman problems. Recently more nested and realistic problems like scheduling were approached with encouraging results (GislPn, Peterson, and Soderberg 1989, hereafter referred to as GPS 1989). The key strategies for obtaining good and parameterinsensitive solutions with the NN approach for optimization problems in general are as follows: 0

Map the problem onto multistate (Potts) neurons (Peterson and Soderberg 1989) rather than the binary neurons used in the original work by Tank and Hopfield (Hopfield and Tank 1985).

Nrirrnl Cornp~rtatiori 4, 805-831 (1992)

@ 1992 Massachusetts Institute of Technology

L. Gislen, C. Peterson, and 8.Siiderberg

806

Establish the approximate position of the phase transition temperature T , in advance by estimating the eigenvalue spectrum of the equations linearized in the vicinity of the trivial fix-point (Peterson and Soderberg 1989). In this paper we further develop the approach of GPS (1989) to solve scheduling problems. Whereas the testbeds in that work were of a “synthetic” and simplified nature we here aim at solving a realistic problem: scheduling a Swedish high school. This requires further development of our approach in a number of directions. The paper is organized as follows. In the remainder of this section we review the results of GPS (1989). A brief description of the structure of the Swedish high school curriculum and scheduling rules is also given (details can be found in Appendix B). Based on this testbed we give a list of requirements that the NN approach needs to fulfill. In Section 2 the NN approach of GPS (1989) is further developed to meet these requirements. Section 3 contains prescriptions for how to deal with the soft constraints occurring in the problem, and in Section 4 we discuss the self-coupling terms needed to improve the dynamics. Issues related to the mean field dynamics are dealt with in Section 5, in particular phase transition properties of interacting Potts neurons with different number of components (details of this discussion can be found in Appendix A). A realistic problem is solved by the NN approach in Section 6. In Section 7 we briefly discuss other approaches for this kind of problem like linear programming. Finally in Section 8 a brief summary and outlook can be found. 0

1.1 Neural Networks and Scheduling-A Synthetic Example. In GPS (1989) a simplified scheduling problem with the appropriate basic structure, where N, teachers give N, classes in N, class rooms at N t time slots, was mapped onto a Potts neural network. In this problem one wants solutions where all N, teachers give a lecture to each of the Nq classes, using the available space-time slots with no conflicts in space (class rooms) or time. These are the hard constraints that have to be satisfied. In addition also soft constraints like continuity in classrooms were considered. The basic entities of this problem can be represented by four sets consisting of N,,N,, N,, and Nt elements, respectively. There is a very transparent way to describe this problem that naturally lends itself to the Potts neural encoding of (Peterson and Soderberg 1989), where events, defined by teacher-class pairs (p. 9). are mapped onto space-time slots (x.t ) . Multistate (or Potts) neurons SF,,7;.rr are defined to be 1 if the event ( p . q ) takes place in the space-time slot (x.t ) and 0 otherwise. The hard constraints in this picture are as follows: 1. An event (p, q ) should occupy precisely one space-time slot (x,t ) .

2. Different events ( p l . q l ) and (p2,qz) should not occupy the same spacetime slot (x.t ) .

Scheduling with Potts Neural Networks

807

3. A teacher p should have at most one class at a time.

4. A class 9 should have at most one teacher at a time. A schedule fulfilling all the hard constraints is said to be legal. The first constraint can be imbedded in the neural network in terms of the Potts normalization condition I .f

for each event ( p .9). In other words we have N,,N,, neurons, each of which has N,Nt possible states. The other three constraints are implemented using energy penalty terms as follows: (1.2)

(1.3)

This way of inplementing the constraints is by no means unique. In particular one could add a family of terms with no impact on the energy value that merely have the effect of adding a fixed constant to the energy. These extra terms turn out to have a strong impact on the mean field dynamics (see below). In the next step mean field variables V1,,i,lt= (S1,q,lf)T are introduced. The corresponding mean field equations at the temperature T are given in terms of the local fields UJ,,i,,t,

(1.5) as (1.6)

Equation 1.6 is then iterated with annealing, i.e., starting at a high temperature, and successively lowering it in the course of the process. The critical temperature Tc, which sets the scale of T , is estimated by expanding equation 1.6 around the trivial fix-point (1.7)

808

L. Gislkn, C . Peterson, and 8.Soderberg

Figure 1: Schematic view of a neuron with a self-coupling corresponding to a diagonal term in the energy. At sufficiently low temperatures the dynamics effectively becomes discrete, turning into a modified version of local optimization. In this regime it turns out to be advantageous to use autobiased local optimization by minimizing a modified energy

where EL) adds a diagonal part to the energy expression. It amounts to a bias with respect to the present neuron state. These autobias terms in the energy correspond to the coupling of neurons to themselves (cf. Fig. 11, such that the continuity in (computer) time of the neurons is either rewarded or penalized, depending on the values of the connection strengths.' One effect of the autobias terms is that they affect the possibility of escaping from local minima. If these terms are small enough as compared to the energy quantization scale (we will refer to this case as low autobias), we obtain a low temperature limiting behavior similar to the case without autobias. The behavior at a nonzero temperature will be different, however. Within the low-autobias region, one can choose the parameters such that unwanted local minima become flattened out, while at the same time keeping the difference in critical temperatures between the different modes reasonably low. 'Autobias is not to be confused with what is normally called bias, and which corresponds to a connection to a permanently fixed, external neuron.

Scheduling with Potts Neural Networks

809

The performance of this algorithm was investigated in GPS (1989) for a variety of problem sizes and for different levels of difficulty, measured by the ratio between the number of events and the number of available space-time slots. It was found that the algorithm consistently found legal solutions with very modest convergence times for problem sizes (N!,. N q ) = ( 5 . 5 ) . . . . (12.12). With convergence time we here mean the total number of sweeps needed to obtain a legal solution no matter how many trials it takes. Also when introducing soft constraints like the preference of having subsequent lessons in the same room, very good solutions were obtained. 1.2 The Swedish High School System. We have chosen to use schedules inspired by the Swedish high school system as testbeds for two reasons. One reason is that we have easy access to data and the other and more important one is that this is a more complicated and nested problem than the corresponding US.system and hence constitutes more of a challenge to the algorithm. The Swedish high school system we use is described in some detail in Appendix B. Let us here just sketch the main structure in order to see which extensions of the formalism of GPS (1989) that are needed. To illuminate this structure it is instructive to compare with the more widely known US.system. In the US.high school system the students are free to pick subjects for each semester subject to the constraint that the curriculum is going to be fulfilled in the end. The number of subjects chosen has to coincide with the number of hours per day. The schedule looks the same for students and teachers each day-one has a day periodicity. It also implies that "classes" are never formed like in an elementary school, where a set of equal-grade students continously have lessons together. The Swedish high school system is very different. Basically the curriculum is implemented i n the same way as in an elementary school in the sense that "classes" are formed that remain stable for all 3 years. Moreover the schedules look different from day to day-one has a week periodicity. Most subjects are compulsory, but not all. For optional subjects (in particular foreign languages) the proper classes are divided into option groups that subsequently recombine to form temporary classes. To get a feeling for the complexity of the problem we refer the reader to Tables 2-5. 1.3 Needed Extensions of Formalism. The synthetic scheduling problem of GPS (1989) contains several simplifications as compared to realistic problems and in particular when it comes to the Swedish high school system. Here we give a list of items that an extended formalism will need to handle.

1. One week periodicity (occasionally extended to 2- or 4-week periodicity).

L. Gislen, C. Peterson, and B. Soderberg

810

2. In GPS (1989) each teacher has a class exactly once. In our case the teacher has to give lessons in certain subjects a few hours a week, appropriately spread out. 3. In GPS (1989) it was assumed that all classrooms were available for all subjects. This is not the case in reality. Many subjects require special purpose rooms.

4. Many subjects are taught for 2 hours in a row (double hours).

5. Group formation. For some optional subjects the classes are broken up into option groups temporarily forming new classes. 6. Certain preferences have to be taken into account, to meet, e.g., special desires from teachers. 2 Neural Encoding 2.1 Factorization into x- and f-Neurons. Prior to extending the formalism to cover points 1-6 there is an important simplification that can be made with the formalism used in the synthetic problem of GPS (1989). It turns out that with the encoding Sp4;.y, the MET equations give rise to two phase transitions, one in x and one in t. In other words the system naturally factorizes into two parts. It is therefore economical to implement this factorization already at the encoding level. This is done by (XI (TI replacing Sl,q;x,by x-neurons Sp,J;.r and t-neurons S,,,J;,

with separate Potts conditions replacing equation 1.1 (2.2) and

c

SE;{ = 1

(2.3)

I

respectively. This means that the number of degrees of freedom reduces from N,N,N,N, to N,,N,(N, -tN t ) . Redoing the problem of GPS (1989) with this factorized formalism we find that the quality of the solutions does not deteriorate with this more economical way of encoding the problem. Needless to say the sequential execution time goes down. In what follows this factorized encoding will be used.

Scheduling with Potts Neural Networks

811

2.2 Extending the Formalism. When taking into account the points 1-4 above in our formalism one cannot keep p and q as the independent quantities. Rather we define an independent variable i (event index) to which p and q are attributes, p(i)and q(i). The values of these exist in a look-up table. This table contains all the necessary information to process each event i. The time index t we subdivide into weekdays (dand daily hours ( h ) . The Potts conditions (equations 2.2 and 2.3) now read

and

cs;;'

=

(2.5)

1

The interpretation of S;,$'' is of course the same as before-it is 1 if event i takes place in room x (at time t ) and 0 otherwise. The Potts neurons will have a different number of components. We describe this by the matrices C:'i and C:,) defined as

c ! ~=)

i

C;;.Y

{

1;r

1 if t allowed for f-neuron i 0 if t not allowed

and ~

-

1 if x allowed for x-neuron i 0 if x not allowed

To facilitate the handling of double hours we introduce effective t neurons S;;' defined as (2.6) k=O

where the multiplicity g, is 1 for single hours and 2 for double hours. With this notation the collision energies of equations 1.2, 1.3, and 1.4 read:2

Exr

(2.7)

= r.,

,#I'

(2.8)

(2.9) 2Throughout this paper the notations like i # i' means either a double sum or a single sum over i with fixed i' depending on context.

812

L. Gislen, C. Peterson, and B. Soderberg

1-

....

classes Option groups

Classes

Figure 2: Formation of option groups and recombination into primordial classes 2.3 Periodicity. Most subjects have week periodicity. However, a few subjects with few hours per semester might occur with 2- or 4-week periodicity. This feature is easily implemented in our formalism. 2.4 Group Formation. So far the classes have been the fundamental objects. To account for point 5 above we need a formalism that allows for the breaking up of these primordial classes and subsequent recombinations into quasiclasses. This is a very common practice with, e.g., foreign languages (see Appendix 8)where the students have many choices. To be economical with resources, option groups with a particular choice from different primordial classes should form a temporary quasiclass (see Fig. 2). This complication can be accounted for by a minor modification of the EQT collision term (equation 2.9) concerning the overlap between classes q(i) and 9(i'). We extend the possible class values q(i) to include also quasiclasses. The Kronecker 6 in equation 2.9 ensures that only events i and 'i with identical primordial classes are summed over. In the case of group formation into quasiclasses one might have contributing pairs of events where the primordial classes are different. Hence one should replace 6 with a more structured overlap matrix r, which is given by a look-up table.

2.5 Relative Clamping. In neural network vocabulary the phrase clamping is normally meant to imply that the equations of motion are settled with a few units not subject to the dynamics-they are clamped. Such phenomena will occur in our applications, in particular when it comes to revisions of existing schedules. What we are concerned about here is another kind of clamping when one wants a cluster of events to stick together in time-relative clamping. In our formalism this amounts

Scheduling with Potts Neural Networks

813

to make the notational replacement

i + (j. k)

(2.11)

where j denotes the cluster and k an event within the cluster. In this case one has a common t-neuron for the cluster (S;,:)) but distinct y-neurons ( S :) for the individual events. 2.6 Null Attributes. Some activities lack teachers (e.g., lunch) and others involve no classes (e.g., teacher's conferences). It is convenient to include those events in our general formalism by letting p(iJ = 0 or q ( i ) = 0. Such events of course do not contribute to EPT and E c ) ~ , respectively. 2.7 Final Expressions of Collision Penalty Terms. We are now ready to write down the generalization of the collision terms of equations 2.7, 2.8, and 2.9 that ensure that all the hard constraints are satisfied; the solutions are legal. In the next section we will then include penalty terms corresponding to the soft constraints.

(2.12)

(2.13)

(2.14)

To restrict the sum over p in equation 2.13 to p ( j . k ) # 0 events we have introduced (Il.k according to

L. Gislbn, C. Peterson, and B. Soderberg

814

3 Soft Constraints

There are basically four kinds of soft constraints we encounter when scheduling Swedish high schools. 1. The different lessons for a class in a particular subject should be spread over the week such that they d o not appear on the same day.

2. The lunch "lesson" has to appear within 3 hours around noon. 3. The schedules should have as few "holes" as possible; lessons should be glued together.

4. Teachers could have various individual preferences. The second point is easily taken into account in the Potts condition for the relevant p ( j . k ) = 0 or q ( j . k ) = 0 events. Hence it is not formally a soft constraint. The individual preferences (point 4) will be omitted in our treatment due to lack of data. Here we will focus on spreading and gluing. 3.1 Spreading. First we assume that we have access to a subject attribute s ( j . k ) . We then introduce a penalty term E Q ~ Dthat spreads the lessons for a class in a particular subject over the different week-days.

I#]' kk'

d

In equation 3.1 we have introduced an effective "day-neuron'' according to

3.2 Gluing. To avoid "holes" in time for a class we need a penalty term that rewards situations where (A. h)-events are glued to (A, h - 1)events. The following reward energy serves that purpose: EQDH

=

-h

c c [5

6q

9

d I1

I!(/

k)s;,2]

[,gk,

r4 W P ) g ,',d(il-ij r j

1 (3.3)

Scheduling with Potts Neural Networks

815

In equation 3.3 the parameter i~governs the relative strength of this reward relative to the collision energies (equations 2.12, 2.13, and 2.14). 4 Diagonal Terms

It is clear from equations 2.4 and 2.5 and the fact that S;,:;;’’ are either 0 or 1, that any partial sum X of neuronic components must also be 0 or 1, such that

x2 = x

(4.1)

We can use the particular combinations ElSj,;l2 = 1 and add trivial-valued auxiliary terms to the energy:

C , S12:

= 1 to (4.2)

These are the only nontrivial terms of this kind, which respect the obvious permutation symmetries of the problem. The effect of these extra terms on the energy is merely that of adding a fixed constant, but they will turn out to be important for the mean field dynamics. These diagonal terms correspond to self-coupling interactions (see Fig. 1). Finding legal solutions with no collisions corresponds to minimizing the hard energy €hard = E X T

+ €1’7 + EQT

Legal solutions are given by

(4.3) Ehard = 0.

The soft energy is given by

(4.4) Esoft = EQSD+ Ecjnrr The total energy E to be minimized is then a sum of the hard and soft pieces:

E

(4.5)

= €hard f €,oft

5 Mean Field Dynamics

-

Now we introduce the continuous mean field variables V(x),A,, and V(T),k,l, corresponding to S(x),k,rand S‘T),k,,, respectively. These have the interpretation of being probabilities that the corresponding events occur at given x- and t-values with normalizations given by equations 2.4 and 2.5 replacing S by v. Substituting V for S in the energy expressions, the mean field equations at a finite temperature T , are given in terms of the local fields U(x)lk,,and L I ( T ) , k , ,

(5.1)

L. Gislkn, C. Peterson, and B. Soderberg

816

as (5.3)

(5.4) where it is understood that only allowed states are considered. The natural mean field dynamics for this system consists of iterating equations 5.1-5.4. There are two main options for how to do this: 0

0

Synchronous updating: A sweep consists of first computing all the local fields, and then updating all the neurons. Serial updating: Here, for each neuron S ~ ~ ~the ~ ~local , ' ,field is computed immediately before the corresponding neuron state is updated. Thus, in this case, the sweep order might matter slightly for the outcome of the dynamics. We have consistently used ordered sweeps.

For both methods, in addition, the performance depends on the values of the +parameters and T. The next problem is to understand their effect, and to give them suitable values.

5.1 Choice of Parameters. In principle one could run the algorithm at a fixed temperature if a suitable value for the the latter is known in advance. Empirically, however, a more efficient way of obtaining good solutions turns out to be annealing, i.e., starting at a high temperature, and successively lowering it. One can get a good understanding of the parameter dependence by considering the two limits: 0

0

High T. At a sufficiently high temperature the system is attracted to a trivial fix-point. The behavior of the system in this phase can be understood in terms of linearized dynamics in the vicinity of the fix-point.

Low T. When the temperature tends to zero, the system enters another well-understood phase: the dynamics effectively becomes discrete, turning into a modified version of local optimization.

At intermediate temperatures, the dynamics is more complex-it is here that the actual computing is done. Loosely speaking, the smooth energylandscape appearing at high T is gradually gaining structure, and the neurons are gradually forced into a decision. When T i 0 the original discrete landscape is recovered, and the neurons get stuck in a firm decision, representing a local minimum of the energy.

Scheduling with Potts Neural Networks

817

Prior to investigating the two extreme T-regions we want to define order parameters that monitor the transitions from one phase to the other. As in GPS (1989) we find i t convenient to introduce t- and .u-saturations for this purpose. (5.5) and (5.6) where Nx and Nr denote the total number of .Y- and t-neurons, respectively (for the t-neurons of course only the clusters are counted). The T + 0 limit is characterized by Yx.ST + 1. 5.2 The High Temperature Phase. As stated above, at high temperatures the system has an attractive trivial fix-point, and the high-T dynamics can be understood by linearizing it in the vicinity of this point. In GPS (1989) this fix-point was the totally symmetric one of equation 1.7. In this case the situation is somewhat more complicated due to the lesser degree of symmetry. At high temperatures, however, the trivial fix-point is well approximated by the symmetric point (corrections are of order l/T). For the lin. this point, the dynamics (equations 5.3 ear fluctuations t ~ ~ , ~ ’around and 5.4) is then replaced by the linearized equations (see Appendix A)

zi\l;

(5.7)

(5.8) Here, the Q-matrices are mere projections onto the locally allowed states and K, ( K , k ) is the number of allowed states for t-neuron j (x-neuron jk), and d x land zi”’ are the linear fluctuations in the corresponding local fields,

(5.10) where A, B, and C are matrices resulting from linearizing equations 5.1 and 5.2. Due to the 1/T in equations 5.1 and 5.2, u;:: and v;,:’will tend to zero under iteration of these equations for high enough T , and the fixpoint is stable. This defines the high-T phase. At a certain temperature

L. Gislen, C. Peterson, and B. Soderberg

818

T,, the stability will be lost due to an eigenvalue crossing the unit circle. Empirically, for moderate parameter values the corresponding mode is dominantly a t-mode, which is not surprising, since the x-neurons appear in only one of the energy terms, whereas the t-neurons appear in all of them. Thus, to estimate T, we can disregard the x-modes and we have the following equations for the t-modes: (5.11)

(5.12) or, with matrix notation in Potts space, (5.13) For synchronous updating, an update sweep consists of iterating this matrix equation, and since the matrix is proportional to 1/T, the computation of T, is reduced to the computation of the eigenvalues of a fix matrix. The case of sequential updating is a little more tricky. A generic discussion of the high-T phase-transition for both types of updating can be found in Appendix A. The point we want to make here is that, for both methods of updating, the critical temperature can be estimated in advance. It will of course depend on the self-coupling / j ( T ) (and in principle also on /I(')). This dependence will typically be as follows (cf. also Fig. 3): 0

0

0

T c ( d )has a negative slope (= -l/K$), For large negative = where KLL' is an effective Potts-dimension) and the relevant eigenvalue at T, is -1, which is disastrous for the stability of the dynamics, since it gives rise to oscillating behavior. For sequential updating, there follows a region, still with negative slope, with complex critical eigenvalues (in the synchronous case this is ruled out, since the relevant matrix can be made Hermitian). Finally, for large enough 4 the slope is positive (= l/K$)) and the relevant eigenvalue is +l. This is the desired case, but it is difficult to obtain with synchronous updating, and we have therefore consistently used sequential updating, where it can be obtained even with a small negative self-coupling.

5.3 The Low Temperature Limit. When T + 0, the dynamics effectively turns discrete (winner-takes-all), and is in the absence of selfcoupling equivalent to simple local optimization of the discrete energy of equation 4.5. With a nonzero self-coupling, we obtain as mentioned in the introduction autobiased local optimization: staying in the present

Scheduling with Potts Neural Networks

819

Figure 3: The dependence of T, on the selfcoupling ,j. for synchronous (SYN) and sequential (SEQ) updating. The type of eigenvalue is indicated by (il)or (U) (for complex unitary). state is rewarded or penalized, depending on the sign of the self-coupling parameter d. If the self-coupling is positive, there is an increased tendency to stay in the present state, leading to an increased stability. With a negative self-coupling, the effect is the opposite: the stability of the present state is decreased. In both cases, if the self-coupling is smaller than the quantization scale of the energy, it has no effect in practice at small T.

6 Solving a Realistic Problem The entire process of scheduling high school activities in a given geographical area is very complex. With existing manual or semimanual procedures it is always factorized down to a few independent steps. 0

Concentrate certain majors and special subjects (e.g., languages) to certain schools. Assign the students to these schools based on their choices of majors and options (see Appendix B). Form the classes and option groups at the different schools accordingly.

L. Gislen, C. Peterson, and B. Soderberg

820

0

0

Given their competence assign teachers to the different classes and groups. Given the conditions generated by the two steps above solve the scheduling problem.

It is the last step that is the subject of this work. We have used data from the Tensta high school (see Appendix B) as a test problem for our algorithm. Since a complete set of consistent data has not been available to us we have chosen to generate legal schedules based on available class and teacher data. For each time-slot in a week approximately N q events are generated in the following way': For each event a teacher, a subject, and a classroom category are randomly generated, with the constraint that no collisions occur. Two classroom categories are assumed to exist, small and large, to host entire classes and minor groups, respectively. These two categories are generated with suitable relative probability. Similarly single and double hour lessons are generated with equal probability. We have generated 10 different problems in this way. One might object that this is not an entirely real problem and hence not relevant for testing the algorithm. We think the situation is the other way around-our generated problems are very likely far more difficult since there are no preset ties between teachers, classes, and classrooms that makes the problem more structured in terms of lesser "effective" degrees of freedom. Our algorithm is implemented in the following "black box" manner in the case of sequential updating: 0

0

0

0

0

0

0

Choose a problem. Set 6 = 0.2, )'(j/ = -0.1 and P(T)= -0.1, respectively. The results are fairly insensitive to these choices. Determine T,by iterating the linearized dynamics (equation 5.7). Initialize with Vit: = C ~ ~ ~ / K+ ~0.001 ~ ) x(rand[-1.11) 1 and V;;;' = CjJ)/KiT)(l + 0.001 x rand[-1. l]),respectively. Anneal with T,, = 0.9 x T,,.., until C

= 0.9.

At each T,, perform one update per neuronic component (= one sweep) with sequential updating using equations 5.1-5.4. After C = 0.9 is reached check whether the obtained solutions are legal, i.e., Ehard = 0 (equation 4.3). If this is not the case the network is initialized with a different seed and is allowed to resettle. This procedure is repeated until a legal solution is found.

"When data.

consistent lists will become available to us we of course intend to use real

Scheduling with Potts Neural Networks

821

Table 1: Four-Week Schedule for E2b.n Slot M

T

W

1 2

W C

3 4 5 6

E F “ E H

7 8 9 10

I K

s s

T F

M

T

W T F M?--WT

C G

C

*

T

F

x c A C G D E F G G D E Y G ‘ * E H G * * E H

G

F

M

B

W

E Y * E H

* * * J L M M O K Q

l K

* * * L M N

J O

I K

C

K

D Q T

S S U

R F

C S

P D

* L

T

W

B

*

* N

J O

I * K L

Q R C D Q F S

K

*

* N

G

J

O

Q D Q

T D S T Q 5 T u V V U V U V ~”Different letters stand for clitferent subjects; lunch is denoted by an asterisk. Lessons that differ from week to week are given in boldface.

We have used this procedure for 10 generated sets of data as described above. In Figure 4 we show a typical evolution of E h a r d , S r , S x and T as a function of Nhhaeep. For the 10 experiments performed the average Nsweep needed to reach a legal solution was = 70. We did not observed a single case where more than one annealing was needed to obtain an Ehard = 0 solution. Although there is no objective measure of the quality of solution, in the opinion of individuals acquainted with the scheduling process, all the network solutions are comparable to or better than solutions that would be obtained with the semimanual or manual procedures currently in use. A typical example is shown in Table 1, where we note how well the algorithm has glued the lessons together-very few holes. Another prominent feature is how well the algorithm handles the fact that events occur with different periodicity. The algorithm was implemented in F77 on an APOLLO DN 10000 workstation. It takes approximately 1 hour to perform the required 70 sweeps. 7 Comparison with Other Approaches

What other algorithms do exist that have been pursued for the kind of realistic problems dealt with in this paper? An extensive search in the operations reasearch literature has not yielded any result with respect to algorithms that deal with scheduling problems of our size and complexity. This is not to say that there exist no computerized tools for scheduling in the educational sector. However, all those programs known to the authors represent refinements to the manual method with bookkeeping capabilities for the manual decisions made. Since these are not problem solvers in the real sense we have not found it meaningful to compare the results from our algorithm with procedures using such packages. For subproblems with less complexity there exists in the

*

L. Gislh, C. Peterson, and B. Soderberg

822

50

50

T

50

Figure 4: Energy (€hard), saturations ( C T , Ex), and temperature (T) as functions of Nsweep for one run. literature an automated approach for classroom allocation using linear programming (Gosselin and Truchon 1986). This is a simpler problem since it does not contain the time-tabling part. We have not had access to the test bed of this simpler problem. It is nevertheless interesting to compare the algorithmic complexity of the neural approach versus the linear programming one for this problem.

Scheduling with Potts Neural Networks

823

Denote the total number of rooms and and time slots with N x and NT, respectively. In Gosselin and Truchon (1986) the rooms are categorized, so another important quantity is the number of room categories Ni. The computational load (11,) with linear programming should then scale as 11,

x N;?-N;N;

(7.1)

One can also convince oneself that with the Potts encoding and MFT techniques used in this paper the corresponding scaling relation is (with Mx the average number of room alternatives per event) 11, x

M.YNXNT

(7.2)

where we have assumed that the number of iterations needed for the MFT equations is constant with problem size. This latter empirical fact seems to be realized whenever the location of T,is estimated in advance using the linearized MFT equations as is done in this paper and in Peterson (1990). As can be seen from equations 7.1 and 7.2 the MFT neural approach scales better in Nx and N T for this problem. The same relative advantage for the neural approach of course holds for the problem dealt with in this paper-that is the reason the neural approach is able to solve a problem of this complexity. 8 Summary and Outlook

In this paper we have applied a neural network technique to find good solutions to a difficult scheduling problem with very good performance results both with respect to quality and CPU time consumption. Even with sequential updating of the neurons the algorithm can be efficiently executed on a SIMD computer like CRAY by parallelizing the calculation of the local fields (equations 5.2 and 5.1). Speedup factors proportional to N , (N,) can then be obtained, which in our case corresponds to 50-60. This amounts to an execution time of approximately I m i i i i i f c . on a CRAY XMP. Indeed, such gains were realized when executing the problem of GPS (1989) on a CRAY XMP. As compared with previous work (GPS 1989) we have developed the formalism in the following directions:

1. Factorization into s-and t-neurons (equation 2.1).

2. Analysis of the dynamics in the critical T-region also in the case where the neurons have different Potts dimensions. 3. Extension of the formalism to deal with group formation, double hours, week periodicity (sometimes broken into 4-week periodicity), relative clamping, and spreading. All of these features are necessary for solving Swedish high school problems.

824

L. GislPn, C. Peterson, and B. Soderberg

Also, the realistic problems we are considering constitute a substantially larger problem size than dealt with in GPS (1989). With the approximately 90 teachers, 50 weekly hours, 45 classes, and 60 classrooms the problems correspond to approximately possible choices. In our factorized Potts formulation this corresponds to roughly lo5 neuronic degrees of freedom. Empirically we have found that the number of sweeps needed to produce legal solutions increases only very slowly with the problem size N,,implying that the CPU time consumption scales approximately like Ni. A revision capability is inherent in our formalism that is handy in situations when encountering unplanned events once a schedule exists. Such rescheduling is performed by clamping those events not subject to change and heating u p and cooling the remaining dynamic neurons. One should keep in mind that problems of this kind and size are so complex that even several man-months of human planning will in general not yield solutions that will meet all the requirements in an optimal way. We have not been able to find any algorithm in the literature that solves a realistic problem with this complexity. Existing commercial software packages do not solve the entire problem. Rather the problem is solved with an interactive user taking stepwise decisions. Linear programming methods have been applied to the simpler problem of classroom allocation only. These methods scale with problem size and complexity in such a way that it is very hard to deal with the kind of problem dealt with in this work. In cases where the dimensions of the optimization problem due to geometric nature can be dimensionally reduced (e.g., the traveling salesman problem) it is advantageous to abandon the Potts probabilistic description to a template [elastic net (Durbin and Willshaw 1987)l formulation for reducing the number of degrees of freedom. No such reduction of number of degrees of freedom is possible in scheduling problems like the one studied here. The scheduling of a high school is of course only one example of a difficult planning problem; there are many similar problems occurring in the industry and in the public sector. We have chosen this particular application since we feel that the problem is representative enough for this class of problems and also because real data were available to us. In one respect this problem is simple-it has no topological structure as in, e.g., airline crew scheduling. We are presently extending our formalism to cover such cases.

Appendix A: The High Temperature Fixed Point For the sake of notational simplicity (and generality) the arguments in this section are carried through with neurons Sin subject to the Potts

825

Sclieduliiig with Potts Neur
condition

(All Consider a n energy with

;1

general form of the interactions

T;:

Replacing S,,, in equation A2 with the corresponding MFT variables V,,,, the local fields U,,, : -1;T OE/OV,,, are given by

where we have inserted ‘1 self-coupling , I . Introducing K,as the number of allowed states for neuron i, atid the matrix C,,, as in Section 2.2 the MFT equations read

At high enough tcniperatiire T the dynamics expressed by equations A 3 and A4 will have an attractive trivial fix-point V:,:’,close to the symmetry point

wliere every allowed state is equally probable. At a certain critical teniperature T,tlie trivial fix-point will become unstable. To find tlie position of this phase-transition, we have to linearize the dynamics in tlie neigli, tlie borhood of the fix-point. In terms of tlie deviation z~,,, = V,,,- V,’,;’’ linearized dynamics takes tlie form

with

II,,,

given by

Approximating V:
where Q, is a mere projection on the locally transverse (i.e., ~ , , z ~ , , ,= 0 )

L. Gislkn, C. Peterson, and B. Soderberg

826

allowed subspace, with the dimension CJK,- 1): Q.

inh -

1 C. C.m hnb - -C. in rh Ki

(A9)

Thus, if we define M‘,; as the projection of Tb: on this subspace,

MP

=

C

QIarT;QlLih

cd

we obtain for the dynamics in the subspace (All)

This can be symmetrized in terms of dJn= f l u , , :

with

Above T, all the eigenvalues of the corresponding sweep update matrix are less than unity in magnitude, and T, is obviously the highest T allowing a unitary eigenvalue. It will depend, though, on how the updating is done. For synchronous updating, the sweep update matrix is simply M /T , and the only possible eigenvalues are 51, since M is real and symmetric. We then obtain, in terms of the extreme eigenvalues, , ,A and Ami, of the matrix M, Tc = max( -Amin.

Amax)

(A13)

and the corresponding dominant eigenvalue is -1 for /jbelow a certain value, and +l above this value. For ordered sequential updating, things are a little more involved. When computing the new value of a neuron, fresh values are used for the already updated neurons. The sweep update matrix is then somewhat more complicated, but can still be expressed in terms of the diagonal (P) part M D , the upper (i < j ) part M U and the lower part M L of M as ( T I - ML)-’(MD +MU)

(A141

and T, in this case is obtained by examining when this matrix has an eigenvalue of modulus one. In this case the eigenvalues do not have

Scheduling with Potts Neural Networks

827

to be 5 1 , but these values play a special role also here. Assuming an eigenvector @ with a unitary eigenvalue e2IH,we have

+

(e”T - ePHMD)O= (e’#Mr ePHMU)o

(A15)

where the total matrix on the right side is Hermitian. Note that with for synchronous updating is reproeigenvalue +1, the result T, =, , A, duced. For other eigenvalues, we can get some information by taking the scalar product with o t , Assuming J!,( to be normalized, we obtain

+

el”T - e-lH,j(l/K) = t ~ + ( e ’ ~ h /e-”Mu)4 l~

(A16)

where (1/K) appears because of the special form of the diagonal term in equation A3. Since the right member of equation A6 is obviously real, the imaginary part of the left member must vanish: sinB(T + d ( l / K ) ) = 0 For eigenvalues

(A171

# 1, this implies

We conclude that eigenvalue 1 is the only possibility for nonnegative [j. Empirically, the dominant eigenvalue is -1 for large enough negative [j, complex unitary for intermediate (1 values, and +1 for large enough l j values (including positive /j). In the special case where all the K , are the same, the eigenvalue -1 appears for sequential updating only in the limit , j + x.In this case the A jdependence of T, is also particularly simple (Peterson and Soderberg 1989). It is piecewise linear for both updating methods.

Appendix B: The Swedish High School System The Swedish high school system differs strongly from that of the U.S. Prior to entering the high school the students decide upon a certain major. These majors could be either theoretical or more trade oriented. The theoretical majors take 3 years to complete whereas some o f the more practically oriented majors take 2 years. Within each major some subjects are optional, in particular foreign languages. Apart from that the students follow each other in classes throughout high school. As mentioned in the text the schedules are essentially periodic over weeks (and not days as in the U.S. system). All in all there are some 20 different possible majors. However, for a typical high school only 8-10 of these are

828

L. Gislen, C. Peterson, and B. Siiderberg

Table 2: Curriculum for Natural Sciences (N), Social Studies (S) and Fine Arts (FA) Majors.

N

S

FA

Subject

yr 1

yr 2

yr 3

yr 1 yr 2

yr 3

yr 1

yr 2

yr 3

Swedish English B/C-language C-language Linguistics Greek History Religion Philosophy Psychology Civics Social science Ma thematics Natural sciences Physics Chemistry Biology History of music and arts Arts/music Special subject Physical education Project

3 3 3/4

3 3 3

3.5 3/0" 3/ 0"

3 3' 3/4 3l

3 3c 3' 3'

3.5 3 3 3

3 3 3/4 4

3.5 3'1 3'1 3'1

2

2

2/0" 2 2/01, 0 / 2"

2

5

2

2

2.5

3 3 5 4

2.5 3' 3 4

2.5 3 2 2 6

3 3 3 3 3 6" 5

3

2.5

611

3

5 4

4

h

2.5 3.5

-

7

4

4

4 2 3

3 2 2 / 0"

2

3

2

2

2

0/3' 2.5

0 / 5" 2 2

3

2.5 3 2 2

2 2.5

2 2

3

2.5

2 2

~

"ln year 2 or 3 a special subject can be chosen. In that case history of music and arts is dropped in year 2. In year 3 the B/C language or English is dropped together with history. "Choice between philosophy and psychology in year 3. ' Two of these subjects are chosen. "Greek is optional in year 3, in which case civics and two out of three languages are dropped.

available. In the particular high school we use as a test-bed the following 8 majors are present: N

S

= Natural sciences = Social studies

FA = Fine arts = Economics = Commercial

E C

(3 years) (3 years) ( 3 years) (3 years) (3 years)

T So

= Technology = Social studies

H C = Health care = Merchandise and consumer techn.

(4 years) (2 years) (2 years)

M

(2 years)

The corresponding curriculae are given in Tables 2, 3, and 4. A few explanations.

Scheduling with Potts Neural Networks

829

Table 3: Curriculum for Economics (E), Commercial (C), and Technology (TI Majors. E Subject

T

C

yr 1

yr 2

Swedish 3 English 3 3 / 4" B/C-language 411 C-language 2 History Religion Psychology Civics 3 5 Mathematics Physics Chemistry Biology Technology Misc. technical subjects Industrial practice Natural sciences 4" Business economy 4 Administration Law Marketing Methodology ergonomics 2 Typing Shorthand Information and text processing Communication Special subject Physical education 3 Project

3.5 3 3" 3'1

yr 3 yr 1 - y r 3 3 3 3" 3"

7 6

yr 1 yr 2 yr 3 yr 4 3 3 3

3 3 3/4

2 2 2 3"

2.5 3"

2 5 6

2 6 2.5 3.5

3 2 2

2

2

2.5

5

2.5 5 4 3 2 4.5

3

3

4 4

12

6

30

2-4

14 2 2 14

3

2-3 3 3"

3" 16

2 2

2

2 28 4

3

2

2

~

"Two of these subjects should be chosen. "Two of these subjects should be chosen, of which at least one should be B- or C-language. The distribution o f weekly hours between the different years may vary locally for the commercial major.

All numbers refer t o weekly hours. Fractional weekly hours like 2.5 could be implemented by having, for example, 2 hours in the fall a n d 3 hours in the spring. Another option is to have alternating weekly schedules. I n elementary school it is compulsory for students to take Swedish

830

L. Gislen, C. Peterson, and B. Soderberg

Table 4: Curriculum for Social Studies I2 year] (So), Health Care (HC), and Goods and Merchandise and Consumer Technology (M) Majors.

so Subject Swedish English B/C-language History Religion Psychology Ergonomics Civics Social science Mathematics Natural sciences Administration Health care Child and youth care Health care practice Arts/music Typing Goods handling Consumer knowledge Special subject Physical education Project

M

HC

yr 1

yr 2

yr 1

yr 2

yr 1

yr 2

4

3 3

4

3

6 3

6 3

1

2

3

3

2 5

2 5

2

2

3 3

2 2

3 3 6

2

2n

2 12 7 3

2 11 7 2

1.5 4 8

3 3 2

3 2

2

3 2

2

4 8 3

2 2

'One of these subjects should be chosen

and English plus one additional foreign language (B-language). When entering high school this B-language can either be pursued further or replaced by a new foreign language (C-language). In some majors both B/C-language and an additional C-language is chosen. B/C-langauges could be French, German, Spanish, Russian, etc. varying from school to school.

The Test-Bed Problem The realistic problem we have considered was obtained from Tensta High School in terms of number of classes for the different majors. More specifically this high school has the classes shown in Table 5, which roughly corresponds to 1000 students.

Scheduling with Potts Neural Networks

831

Table 5: Tensta High School Classes in 1990. In cases with more than class per major and year the notation is a, b, . . .. Major

year 1

year 2

year 3

N S H E C T

Nla,Nlb s1 H1 Ela,Elb Cla,Clb T1 Sola,Solb HCl a,HCl b M

N2a,N2b s2 H2 E2a,E2b C2a,C2b T2 So2a,So2b HC2

N3a,N3b 53 H3 E3a,E3b c3

so HC M

Acknowledgments We would like to thank L. Jornstad for providing us with data from the Tensta High School.

References Durbin, R., and Willshaw, G. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689. Gislen, L., Peterson, C., and Soderberg, B. 1989. Teachers and classes with neural networks. Znt. J. Neural Syst. 1, 167. Gosselin, K., and Truchon, M. 1986. Allocation of classrooms by linear programming. ]. Oper. Res. Soc. 37, 561. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization. Neural Comp. 2, 261. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. 1.Neural Syst. 1, 3.

Received 13 May 1991; accepted 15 May 1992.

This article has been cited by: 1. J. Hakkinen, M. Lagerholm, C. Peterson, B. Soderberg. 2000. Local routing algorithms based on Potts neural networks. IEEE Transactions on Neural Networks 11:4, 970-977. [CrossRef] 2. Jari Häkkinen , Martin Lagerholm , Carsten Peterson , Bo Söderberg . 1998. A Potts Neuron Approach to Communication RoutingA Potts Neuron Approach to Communication Routing. Neural Computation 10:6, 1587-1599. [Abstract] [PDF] [PDF Plus] 3. F. Shiratani, K. Yamamoto. 1998. Deterministic annealing techniques for a discrete-time neural-network updating in a block-sequential mode. IEEE Transactions on Neural Networks 9:3, 345-353. [CrossRef] 4. Martin Lagerholm , Carsten Peterson , Bo Söderberg . 1997. Airline Crew Scheduling with Potts NeuronsAirline Crew Scheduling with Potts Neurons. Neural Computation 9:7, 1589-1599. [Abstract] [PDF] [PDF Plus] 5. Carsten Peterson, Ola Sommelius, Bo Söderberg. 1996. Variational approach for minimizing Lennard-Jones energies. Physical Review E 53:2, 1725-1731. [CrossRef] 6. Mattias Ohlsson , Carsten Peterson , Bo Söderberg . 1993. Neural Networks for Optimization Problems with Inequality Constraints: The Knapsack ProblemNeural Networks for Optimization Problems with Inequality Constraints: The Knapsack Problem. Neural Computation 5:2, 331-339. [Abstract] [PDF] [PDF Plus] 7. Laurene V. FausettBoltzmann Machines . [CrossRef]

NOTE

Communicated by Terrence J. Sejnowski

Asymmetric Parallel Boltzmann Machines are Belief Networks Radford M. Neal Departinent of Coinputer Science, University of Toronto, 10 King's College Road, Toronto, Canada M5S 1A4

A recent paper in this journal (Apolloni and de Falco 1991) presents a learning algorithm for "asymmetric parallel Boltzmann machines." The networks described there can be viewed as particular forms of the "belief networks" (also known as "Bayesian networks") that are discussed in Pearl (1988). I have investigated connectionist learning rules for two types of belief network (Neal 1990, 1992), finding a learning rule for networks with sigmoidal units that is mathematically equivalent to that of Apolloni and de Falco. However, my computational implementation of this rule is different. The networks of Apolloni and de Falco consist, when unfolded in time, of some number of layers of O/l-valued units, each with weighted connections from the units of the preceding layer. Given some probability distribution for units in the first layer, a probability distribution for the subsequent layers is defined by letting the conditional probability that unit i in layer L takes the value 1, given the values in the preceding layers, be cr(C,w6sf-'), where sf-' is the value of unit j in layer P - 1, w; is the weight from that unit to unit i, and .(x) = 1/(1 + ec'). The learning problem is to adjust the network weights so as to make the distribution so defined for the last, "output," layer match that observed in the environment. There is also a version that learns a conditional distribution for an "output" given an "input." A "belief network" also defines a probability distribution over units, expressed as the product of the conditional probabilities for each unit given the values of units that precede it in some ordering. Many types of belief networks are possible, differing in how these conditional probabilities are expressed. The networks of Apolloni and de Falco are belief networks in which the units have values of 0 and 1 and conditional probabilities are defined in terms of a weighted sum from preceding units, using the sigmoid function. Note that when viewed as belief networks, the layer structure is seen to be inessential. It is necessary only that the units be ordered in some way such that the connections all go forward; output units can be situated anywhere. Mixture models and hidden Neural Computation 4, 832-834 (1992) @ 1992 Massachusetts Institute of Technology

Asymmetric I’arallel Bol t/m,inn M,ichiiies

Figure 1: A fragment of

‘1

833

btBlief network for medical diagnmis

Markov models can also be regarded as belief networks. I n these cases, the networks have a tree structure that permits particularly efficient computa tion. A fragment of a hypothetical belief network that might hc used for medical diagnosis is shown in Figure 1. The arrows into a unit indicate which other units are rclevant in specifying its conditional probability distribution. If these probabilities are specified in terms o f n wrighted sum of inputs, each arrow will have a weight associated nith it. In manually constructed beliet networks sucli as this, it is typical for the arrows to also correspond to the direction in which causal influences ‘ire thought to operate, but this is not necessary if the objective is simply to represent probabilities. Weights for such a network can be learned a s follows. For each training case, a sample is taken from the distribution over all units conditional on the output units having their observed values. For each state thus sampled, the weights are updated by amounts

Apolloni and de Falco sample states compatible with the kno~vnciutputs by simply generating states from the unconditional distribution and discarding those that fail to match the output. If the nuntbrr of outputs i s at all large, this procedure is extremely slow. The stochastic simulation procedure now known as “Gibbs sampling” is generally preferable (Pearl 1988, section 4.4.3). In this method, ot \vhich the standard Boltzmann machine simulation procedure is a particular case, one clamps the values of observed units to their knoivn values, sets the unclarnped units to arbitrary initial values, and then r t p , i t edly chooses a new v a l w for each unclamped unit from its conditional distribution given the current values of all other units. This results in a stationary distribution equal to the conditional distribution given the observational data. I have successfully used this procedure to learn sigmoidal belief networks, as well as belief networks of the “noisy-OR‘’ type

834

Radford M. Neal

(Neal 1990, 1992). The method is similar to Boltzmann machine learning, but without the ”negative phase.” Lack of a negative phase allows learning to proceed significantly faster than in a Boltzmann machine. I believe that connectionist forms of belief networks are a promising way to attack unsupervised learning problems, at which backpropagation does not excel. They may also be useful in integrating empirical knowledge with knowledge derived from human experts, representation of which was the original motivation for work on belief networks.

References Apolloni, B., and de Falco, D. 1991. Learning by asymmetric parallel Boltzmann machines. Neural Comp. 3, 402-408. Neal, R. M. 1990. Learning stochastic feedforward networks. Tech. Rep. CRG-TR90-7, University of Toronto, Department of Computer Science, Connectionist Research Group. Neal, R. M. 1992. Connectionist learning of belief networks. Artificial Intelligence 56,71-113. Pearl, J. 1988. Probabilistic Reasoning in Intelligent System: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA.

Received 23 March 1992; accepted 14 April 1992

This article has been cited by: 1. P. Frasconi, M. Gori, G. Soda. 1999. Data categorization using decision trellises. IEEE Transactions on Knowledge and Data Engineering 11:5, 697-712. [CrossRef] 2. P. Frasconi, M. Gori, A. Sperduti. 1998. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9:5, 768-786. [CrossRef]

NOTE

Communicated by Richard Durbin

Factoring Networks by a Statistical Method Nelson Morgan International Computer Sciwct, Imtitute, Berkeley, C A 94704 U S A

Hem6 Bourlard International Compnter Scierice Iiistitute, Berkeley, C A 94704 U S A and Leriiout 0 Hauspie Speechprodircts, leper, B-8900, Belgium

We show that it is possible to factor a multilayered classification network with a large output layer into a number of smaller networks, where the product of the sizes of the output layers equals the size of the original output layer. No assumptions of statistical independence are required. 1 Introduction

Both on theoretical and practical grounds, it is generally preferable to reduce the number of parameters for a trainable classifier system. In particular, it would be desirable to factor a large multilayer perceptron (MLP) trained in classification mode (only one output with nonzero output) into two or more smaller ones so that the number of connections can be reduced. In some of these cases, the MLP has an extremely large number of output units, for instance representing a correspondingly large number of pattern types or classes that the net will be trained to recognize. For this case, incorporating a probabilistic interpretation permits a simple factorization of the networks that greatly reduces the number of parameters. No statistical assumptions, such as independence of any of the inputs or outputs, are required. We describe this method, and show an efficient implementation of the resulting networks. 2 MLPs to Estimate Framewise Probabilities

Earlier work has shown that the outputs of a multilayer perceptron (MLP) trained in classification mode with a least-mean-square or relative entropy criterion can be interpreted as posterior probabilities of each class given the input (Bourlard and Wellekens, 1990). In other words, the MLP can estimate the probability p(9k I x,,) where q k is a pattern class E Q = { 9 1 . . . . . qK}, the set of all classes for the task (e.g., phonemes Neirrnl Compirfutioii 4, 835-838 (1992)

@ 1992 Massachusetts Institute of Technology

Nelson Morgan and HervP Bourlard

836

for speech applications), and x,,is the input data for pattern ti. If there are K classes, then K outputs are required in the MLP. For classification problems with many classes, the corresponding nets would have a large number of connections. Let coarse categories Y/ and si be chosen to have a unique correspondence with each class 9k. For instance, r, and sy could be the row and . column of a matrix of classes 4 ~ Then,

p(qA 1

xi,)

= p(Y/.s(

I

(2.1)

If J is the number of Y, categories and L is the number of sf categories, then K = I x L. If we use the definition of conditional probability, without any simplifying assumptions the previous expression can be broken down as follows:

Thus, the desired probability is the product of one coarse category posterior probability and a second conditional probability. The former can be realized with a standard MLP probability estimator, using the same inputs, and with L output units. Viewing an MLP as an estimator of the left side of a conditional given the right side a s input, the second term of equation 2.2 can be estimated by an MLP trained to generate the correct class Y, given inputs of class s, (for instance using one-of-L coding, that is with L inputs only one of which is on) and the input x,,. The latter network has J outputs. This procedure reduces the training of a single network with J x L outputs to the training of two smaller networks with J and L outputs, respectively, and represents a generic way of splitting large MLPs used in classification mode into two smaller ones. A similar analysis can easily show how the network could be split into three or more smaller ones. If (J + L ) < (J x L ) , the connections to the network outputs are reduced by this procedure. While reducing the number of parameters, this procedure has the potential, however, of requiring much greater computation during the recognition phase. Indeed, if one implements this method naively, the second network must be computed L times for each pattern during recognition, since the output probabilities depend on an assumption of the coarse class sp (and thus must be evaluated for each such possible hypothesized class). However, this expense can largely be circumvented. Indeed, a simple restriction on the network topology permits the precalculation of contributions from hypothesized coarse class sp to the output; this computation can be done at the end of the training phase, prior to the recognition of any patterns. By simply partitioning the net so that no hidden unit receives input from both s( input units and data input (x,,)units, we can precalculate the contribution to each output unit (prior to any output nonlinearity) for the L possible choices of sc, and form a (J x L)-table of these contributions. During recognition, the presigmoid output values resulting from data vectors can be computed by a single

Factoring Networks

837

standard forward pass on the net for each pattern. For each hypothetical pattern category, these contributions from the data inputs can be added to the corresponding context contributions from the table. The major new computation (in comparison with a simple MLP) then is simply the cost of some lookups, particularly for the sI contributions. We are currently applying this approach to speaker-independent continuous speech recognition, where it is being used to estimate contextdependent phonetic classes, which could number in the hundreds of thousands (Morgan et nl. 1991).

3 Discussion

3.1 The Unrestricted Split Net. When splitting the original big net with J x L output units into two smaller networks with J and L outputs, respectively, the number of parameters is drastically reduced, which could affect the quality of the conditional distributions’ estimation. However, parameter reduction is exactly the aim of the proposed approach, both to reduce computation and to improve generalization. As it was done for p ( s ~1 x,,)(Bourlard and Wellekens 1990; Morgan et 01. 1991), it will be necessary to find [e.g., by using cross-validation techniques (Morgan and Bourlard 1990)l the number of hidden units (and hence the number of parameters) leading to the best estimate of p ( r , I s I . x,,).The desired probabilities can in principle be estimated without any statistical assumptions (e.g., independence). Of course, this is guaranteed only if the training does not get stuck in a local minimum and if there are enough parameters. However, the practical capability of such neural networks to estimate conditional probabilities has already been shown several times (e.g., in the above references), where x,,in p(si 1 x,,)was a vector or a sequence of correlated vectors.

3.2 The Topologically Restricted Net. As shown in the previous section, while reducing the number of parameters, the splitting of the network into two smaller networks results in much greater computation in the contextual network. To avoid this problem it is proposed to restrict the topology of the second network so that no hidden unit shares input from both Se and x,,.Consequently, the sy input only changes the output thresholds. However, a recent experiment with frame classification for continuous speech (trained using 160,000 patterns from 500 sentences uttered by a speaker in the Resource Management continuous speech recognition corpus) suggested that this did not affect the correct estimation of p ( r , I Se. x l l ) .In this example, the network with a split hidden layer predicted (for a test set of 32,000 patterns from 100 sentences) the correct right context 63.6% of the time, while a network with a unified hidden layer predicted the context 63.5% of the time, an equivalent figure.

838

Nelson Morgan and Hervt. Bourlard

Acknowledgments We thank Chuck Wooters and Steve Renals for running the first tests of this new method and our anonymous reviewer for sparking internal discussion about the representation capabilities of o u r networks.

References Bourlard, H., and Wellekens, C. J. 1990. Links between Markov models and multilayer perceptrons. I E E E Trans. Pattern Anal. Machine intelligence 12(12), 1167-1178. Morgan, N., and Bourlard, H. 1990. Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in Neural Information Processing Systems 11, D. Touretzky, ed., pp. 630-637. Morgan Kaufmann, San Mateo, CA. Morgan, N., Bourlard, H., Wooters, C., Kohn, I?, and Cohen, M. 1991. Phonetic context in hybrid HMM/MLP continuous speech recognition. Proc. Eurospeech'92, Genova, 109-112.

Received 6 August 1991; accepted 10 December 1991.

Communicated by Steve P. Luttrell

Maximum Entropy and Learning Theory Griff L. Bilbro David E. Van den Bout North Carolina State Utzizxmity, B0.u 7914, Raleigh, NC 27695-7914

We derive the learning theory recently reported by Tishby, Levin, and Solla (TLS) directly from the principle of maximum entropy instead of statistical mechanics. The theory generally applies to any problem of modeling data. We analyze an elementary example for which we find the predictions consistent with intuition and conventional statistical results and we numerically examine the more realistic problem of training a competitive net to learn a one-dimensional probability density from samples. The TLS theory is useful for predicting average training behavior. 1 Introduction

A statistical theory that describes the learning of a relation from examples was reported in Tishby et d.(1989). It built on earlier work in Schwartz et al. (1990) and has been carefully restated in Levin et al. (1990). In that literature, statistical mechanics was used to relate the probability of independent input-output data pairs to a layered neural network. In this report we will show that the TLS theory can be understood without statistical mechanics, that the form of the data need not be limited to input-output pairs, and that the restriction to layered neural networks can be relaxed to include any model with adjustable parameters. 2 Maximum Entropy and Modeling

In this section we will show that a TLS theory can be constructed for any problem in which parameters of a model are chosen to fit data. Consider a problem in which data {x~}::'' drawn from an unknown density p(x) are to be fitted with some model by adjusting parameters w to minimize an additive error function during a training procedure. In the case of training a layered feedforward net, the data are input-output pairs; the parameters w are the usual weights and biases; the error function and training procedure might be squared error and backpropagation. However the theory is much more general. In later sections, we will apply Neural Cornputatioii 4, 839-853 (1992) @ 1992 Massachusetts Institute of Technology

Griff L. Bilbro and David E. Van den Bout

840

it to the simplest case of linear regression and to the case of learning a one-dimensional probability density. Except for the case of linear regression, there is little theoretical guidance for identifying when data are sufficient to determine the model. Often the total set of available data { x l } ~ is = ~divided into a training set {xl};zyand a remaining test set. The model is trained to a specified error f T on {xl};:y and then tested against the remaining data. In principle this procedure could be repeated for randomly selected test sets and initial values and the average results could be tabulated as functions of 7r1 and f T . Implicit in this hypothetical exercise is an average density ( $ " ' ) ( z L ~ ) ) , of nets trained to an error ET on m examples. The true (p['")(zu)),would be difficult to calculate in general, mostly because it involves the details of the training procedure. However, the end result of training may depend more on the form of the model, the amount of the data, the noise in the data, and the training error E T , than on the details of the training procedure. In that case, the maximum entropy density ( p ( ' " ) ( w ) ) , resembles ($'")(w)),and might be used to predict training behavior and generalization. The principle of maxiinurn etitropy is a general inference tool that produces probabilities characterized by certain average values of specified functions (Jaynes 1979). It is a generalization of the development of statistical mechanics and thermal physics that restrict their averages to physical quantities such as energy, number of particles, or volume. To the extent that entropy measures information, a maximum entropy estimate contains only the information implied by those average values and makes no other assumptions (e.g., about the training procedure). It is useful to also incorporate a prior estimate p(O)(w)of $ " ' ) ( z u ) by considering the entropy of P'"')(w)relative to ~ I [ ' ) ( Z U )

(2.1) In our practice so little is known about $"')(zu) that / J ( ~ ) ( W ) is chosen merely as a restriction to reasonable portions of w space. For example, in backpropagation it is unreasonable to expect the weights as large as O(lOh)and perhaps O(1) is a little too small. In any case the first test of this theory must be for sensitivity to p ( O ) ( z u ) . In the maximum entropy sense, the density p('n)(w) that contains the least information beyond p ( O ) ( 7 0 ) but is nevertheless a normalized density with integral

/ dwpO")(w)1 =

(2.2)

and with specific average training error

(2.3)

Maximum Entropy and Learning Theory

841

can be found with calculus of variations and two Lagrange multipliers ( I and in the usual way. The extremum satisfies (2.4)

which can be solved for

, I " ~ ' ~ ( Z Pto )

get

where , j has been left, but exp(1 - ( I ) has been evaluated in terms of the more conventional normalixttion

that depends on the particular set of examples as well a s their number. Equations 2.5 and 2.6 are related to the classical statistical mechanics Maxwell-Boltzmann particles. Equation 2.6 is the partition function that normalizes Equation 2.5 which becomes the familiar Boltzmann probability Z exp(- c / T ) when restricted to a single particle ( 1 1 1 = 1), with no internal structure ( F independent of x ) in thermal equilibrium at teniperature T = l / , j in a uniform density of states. Equilibrium statistical mechanics derives directly from the principle of maximum entropy: The Boltzmann distribution maximizes the entropy of particles constrained to have a specified average energy (with corresponding Lagrange multiplier 1/T or inverse temperature). When the system in question can exchange particles as well as energy with its environment, but has a specified average number of particles, the resulting maximum entropy distribution involves the Lagrange multiplier or chemical potential. When the volume of the system fluctuates around a specified average volume a similar argument results in the definition of an associated Lagrange multiplier p or pressure. Equation 2.5 is also significant for learning theory. It is an estimate of the probability density of models of the form (i.e., architecture for a neural net) defined by the functional relation between w and s in c (x. 71') after being trained from random initial values of to an average error (equation 2.3) on the particular data set {xl}::y'. However, equation 2.5 is still not useful a s a predictor of average training behavior because it does depend on the particular examples in the set {x,}:';". To remove that effect, we average equation 2.5 over all possible 111 examples ZLJ

Griff L. Bilbro and David E. Van den Bout

842

Equation 2.7 can be used to define a performance criterion, the average

prediction fraction

dtn)= J dw J dxm+lp(xm+l)eXp[-df(Xlll+l.w)I (~('~)(w))~ (2.8) which is the fraction of models distributed according to ~b"')(w)that has an error function c within about 1//j of the next unseen example x , , ~ +on ~ average. Equations 2.7 and 2.8 are inconvenient to evaluate exactly because of the Z('") term of equation 2.6 in their denominators. TLS propose an "annealed approximation," which in the present context is equivalent to replacing Z(In)in equation 2.7 by its average over ways of choosing

{XJE (Z(m))x

= Jdw

J dx("')p(x1) p ( x z ). . . p ( X I N ) P ( o ) ( W )

IC I !=in

~ ( x w) ,.

x exp -/j

(2.9)

r=l

which can be written ( Z ( m ) ) * = /dwp'O'(w)f"'(w)

(2.10)

where (2.11) With this, the average prediction fraction of equation 2.8 becomes (2.12) Equation 2.12 predicts generalization behavior by drawing a maximum entropy inference of the average consistency between the model represented by E ( X , w)and m examples drawn from p(x). We will show that equation 2.12 is well suited for theoretical analysis and is also convenient in practical numerical calculations for small problems. This is because it is easy to produce Monte Carlo estimates for the averages over the { x l } ~ ~byT using the entire set of available data {xl}:=f". 2.1 Relation to TLS. In this subsection we will show that under the assumptions of Tishby, Levin, and Solla, our average predition fraction is proportional to their average prediction probability (APE') ((p'")))

4m)= z ( P ) ( ( p ( " ' ) )

(2.13)

Maximum Entropy and Learning Theory

843

where

z(1l) = /dxexp[-,ir(x.zc,)j

(2.14)

normalizes the probability density for the conditional probability I

p(x I w)s -exp(-h(x. zu)] (2.15) z(Y) which TLS use to describe the behavior of a single certain net w. Equation 2.15 can itself be obtained from a maximum entropy argument in x space but we will not do so here. Now z is a function of ,j but is assumed in T L S to be independent of w. TLS show this to be rigorously true for layered nets with real outputs if f is the usual squared error between data and output. It is true in that case because the area under a gaussian is independent of the mean of the gaussian. Our derivation here is more direct than existing derivations of TLS, who apply Bayes' rule to statistical mechanics. In their treatment, the extra factor of z appears naturally. We can demonstrate the equivalence of equation 2.13 as follows. We solve equation 2.15 for the exponential and substitute it into equation 2.11 f ( w ) = p x m z ( B W I zu)

= z(P)g(w)

(2.16)

where g(w) was defined as "the average generalization of the network in Tishby et al. (1989)

g(w) = /dxP(x)p(x I w )

(2.17)

In a later treatment, g(w)was described as "the sample average of the likelihood" of the network (Levin et al. 1990). It measures how well a certain model w can be expected to explain an example drawn from F7(x). This measure is positive and can be used to compare the performance of two models, but is not generally bounded by unity for real-valued data. We now substitute equation 2.16 into equation 2.12 to get (2.18) which yields equation 2.13 since Levin, Tishby, and Solla define APP as (2.19) where we have used y to distinguish their varable g from the function g(w). They define the denominator of this expression as

Griff L. Bilbro and David E. Van den Bout

844

which can be evaluated with the Dirac delta function to get (2.21)

with a similar expression for the numerator, so that (2.22)

Equation 2.13 follows immediately from equations 2.22 and 2.12. Except for a scale factor z ( B ) ,our average prediction fraction is therefore identical to the TLS average prediction probability under the conditions that TLS assume. Under their restrictions, we find our c,hctn) equivalent to the TLS ((p(’”)))since we are concerned with the dependence of APP on m and z ( P ) is independent of the number of examples m. However we do not require the condition that z(P) is independent of the model w that TLS require for the Bayesian part of their argument (Tishby et al. 1989; Schwartz et al. 1990). However, for consistency with the existing work of Tishby, Levin, and Solla, we will use their ( ( p ( I n ) ) )in the remainder of this report. 3 Analysis of an Elementary Example

In this section we theoretically analyze the problem of estimating a constant from noisy measurements. The utility of this elementary example is that it admits an analytic solution for the APP that can be compared with conventional analysis. All the relevant integrals can be computed with the identity 00

.I,dx exp [-al(x

- b1)2 - a2(x - b2)2]

We take the true value of the constant to be W and assume the noise to be zero mean additive gaussian of variance 1/2ru, so that the density of a measurement with value x is (3.2)

We choose a simple prior density as a gaussian with mean wo and variance 1/2r (3.3)

We choose the simplest error function F(X

I w)= (x - w)2

(3.4)

Maximum Entropy and Learning Theory

845

the squared error between a sample x and the estimate TO. According to TLS we now analyze the problem. We substitute the error function into equation 2.15 to get (3.5)

m,

with z( ,j)= g by j solving

which is independent of

/dxp(x I W ) f ( X . W )

ZO

as assumed. We determine (3.6)

= fT

to get lj =

1

(3.7)

-

2FT

The generalization, equation 2.17, can now be evaluated with equation 3.1 (3.8)

where (3.9) is less than either

(v

or /I. The denominator of equation 2.22 becomes (3.10)

r+mti with a similar expression for the numerator.

3.1 Large Training Set Size. The case of many examples or little prior knowledge is interesting. Consider equations 2.22 and 3.10 for mti >> r

-

(3.11)

which climbs to an asymptotic value of ,/ti/. for m m. To compare this with intuition, consider that the sample mean of {XI.x2.. . . . x,,,} approaches to within a variance of 1/2mrr, so that (3.12) which makes equation 2.22 agree with equation 3.11 for large enough 11. In this sense, the statistical mechanical theory of learning differs from conventional Bayesian estimation only in its choice of an unconventional performance criterion APP. 3.2 Small Training Set Size. We can demonstrate overtraining even for this elementary example when the size of the training set m is small and the the error on the training set f T is reduced too much. It will be

Griff L. Bilbro and David E. Van den Bout

846

sufficient to consider equation 2.12 for r n << ~ r so that (3.13) by which equation 2.22 becomes (3.14) This expression for APP depends on K , which in turn depends on p, which is finally determined by the training target error E T . As a function of K , equation 3.14 exhibits a maximum at &opt =

1 2(wo - q

(3.15)

2

Using equations 3.7 and 3.9, we find that the system will have the highest average prediction probability if training is stopped when the error on the training set is (3.16) 2a which depends only on the variance of the target distribution 1/2a and the difference between the mean of the prior distribution wo and the mean of the target distribution g. Since ( E ) and ET cannot be negative, this optimum is physical only when it is positive. So when equation 3.16 is negative, training should continue to the smallest possible error, since in that case there is no danger of overtraining. Now E T , > ~ ~0 ~occurs only when the error of the prior estimate for the unknown constant is larger than the width of the prior distribution. Furthermore it can occur only when m is not too large, since we have assumed mtc << r. We can therefore translate equation 3.16 into the following rule for this simplest of systems: overtraining occurs when the system is trained to too low an error (i.e., ET < E T , ~ on ~ ~a ) training set that is too small (i.e., rn << r d to compensate for an initial overconfidence in the prior estimate [i.e., 1/2a > (wo- z)21. ET,opt = (w0 - g )2

-1

4 General Numerical Procedure

In this section we show how to numerically apply the theory. We can estimate the moments of equation 2.22 by the following Monte Carlo procedure. Given a data set {x;}jzy drawn from the unknown density p on domain X with finite volume V, an error function E ( X I w),a training error E T , and a prior density p(')(w) of vectors such that each w specifies a candidate model: 1. Construct two sample sets: a prior set of Np functions {wp}drawn from p(O)(w)and a set of Nu input vectors {x,,} drawn uniformly from X. For

847

Maximum Entropy and Learning Theory

each y in the prior set and every I I in the uniform set, tabulate the error fllJ, = f(x,, I a l l , ) . For each i in the training set and every I f in the uniform set tabulate c,;, = f(x, 1 7 ~ ' ~ ' ) . 2. Determine the sensitivity , 1 for a specified

fT

by solving (4.1)

Equation 4.1 can be shown to be monotonic in one-dimensional search

3. Estimate the average generalization of a given

, j so

7oI,

this involves a simple

from equation 2 17 (4.2)

The factor of V remains in this particular expression because each probability density is normalized to have u n i t integral over X so that the Monte Carlo expression for integral of their product retains one factor of V , the integral of X itself. 4. The performance after 111 examples is the ratio of equation 2.22. By construction the wJ, are drawn from p'")(70) so that

(4.3)

5 Learning a Density

~~~

- -.

A TLS theory is completely specified by a dataset {x,};=y [or the corresponding density p(x) in pattern or data space], a prior density , , ' " ' ( 7 ~ ) in 7 u space, and an error function c(x.UI ) that relates xs to i"S. In this sense, learning a density is no different from learning a map. The error function c(s. zu) measures some important difference between data and model and for learning a density, the negative of the log likelihood is a natural choice. Competitive learning nets (CLNs) of the form shown schematically in Figure 1 can be trained with the algorithm of Figure 2 to "learn" the density from which a set of training examples was drawn (Rumelhart and McClelland 1986). We consider CLNs because they are familiar and useful (Melton etal. 1992), because there exist two widely known training strategies for CLNs [the ncurons can learn either independently or under a global interaction called conscience (DeSieno 1988)], and because CLNs can be applied to one-dimensional problems without being too trivial. Asymptotic behavior of these nets depends on the error function F(x. 7 ~ in a known and useful way: the asymptotic density of neurons can be algebraically related to the target density used for training (Ritter 1991).

)

Griff L. Bilbro and David E. Van den Bout

848

i

0.

.1.. *.

02

0

O O .

0

0 "oo"o"

0,"o . 0 0 0 "oo"o"

Figure 1: A two-dimensional competitivelearning net. Top: Evolution of neuron locations toward samples. Bottom: Equilibrium neuron locations in regions of high sample density. Competitive learning nets with conscience qualitatively change their behavior when they are trained on finite sample sets containing fewer examples than neurons; except for that regime we found the theory satisfactory. A CLN with k neurons is specified by a vector w of k neuron locations. We restricted ourselves to a one-dimensional problem on the unit interval. All experiments we will present in this section were conducted on data drawn from the following one-dimensional training density:

0

otherwise

The simplest choice for the prior density p(')(w)of CLNs is a uniform density in a k-dimensional space. This was our initial choice; however, we eventually spent a large part of our time testing the sensitivity of

Maximum Entropy and Learning Theory

849

/* iiiitinlizr Iieiirons with random weights */ for(iel;i
+ IT’,

- ZlJ,,I

/ * f h l the clos~stnetiroii to the vector */ k t l for(i.=l;i
/* ndjrist wciglits of winning tieiiroti */ for( j .= 1;; < W; j t j + 1 ) zllk, e wk, + f ’ (11, - ZIJkl) /* rdiicc the lenrnitig rate */ e (I+ f

f

’

Figure 2: Algorithm for training a CLN. predictions to a variety of prior densities. We chose the error function l=h

f ( x .w)= min ( x- X J , ~ ,=1

(5.1)

which is the distance between the sample x and location of the nearest neuron zu,. The error of several samples is the sum of the separate errors. If the samples are assumed to be independently drawn so that the joint probability of a set of samples is the product of the component probabilities, it can be shown that f is proportional to the log of the conditional probability of observing x given the CLN zu. Our chosen form

Griff L. Bilbro and David E. Van den Bout

850

2

2

.

1.5

'I0

013

-

n

a

019

02,

-

li$

0,s 017

g

019

a

1 -

07

0.7

Figure 3: Predicted APP versus number of training samples for a 20-neuron competitive network trained to various target errors where the neuron weights were initialized from (a) a uniform density and (b) an antisymmetricallyskewed density. for F therefore corresponds to an exponential density around each neuron

p(x I

W l ) 0:

exp(-lx - zu,I)

(5.2)

for x close enough to the ith neuron. The neurons of a trained CLN tend to cluster in regions where the data density is highest. For rn >> k >> 1, the distribution of neurons can be represented by a density that approaches a power of the density of the data (Ritter 1991). In Figure 3 is the average prediction probability (APP) for k = 20 versus m, for several values of target error t T and for two prior densities; first consider predictions from the uniform prior. For f T = 0.01, APP practically attains its asymptote of 1.5 by m = 40 examples. Assuming the APP to be dominated in the limit by the largest g, we expect a CLN trained to an error of 0.01 on a set of 40 examples to perform 1.5 times better than an untrained net on unseen samples drawn from the same probability density. We can therefore define a predicted probable error of order

For k = 20, fproh = 0.017 for f T = 0.01 and fprob = 0.021 for FT = 0.02. We performed 5000 training trials of a 20-neuron CLN on randomly selected sets of 40 samples from the training density. Each network was trained to a target error in the range [0.005.0.03]on its 40 samples, and

Maximum Entropy and Learning Theory

851

-

0.04

r-0

0.01

0.02

0.03

Training Error

a

O 0.03

'

0

0

4

0.02 Training Error

0.01

K

0.03

b

Figure 4: Experimentally determined and predicted values of total error across the training density after competitive learning was performed using a 20-neuron network trained to various target errors (a) with 40 samples and (b) with 20 samples. the average error on the total density was then calculated for the trained network. Figure 4 is a plot of 500 of these trials along with the predicted errors for various target errors. The probable error is qualitatively correct and the scatter of actual experiments increases in width by about the ratio of APPs for m = 20 and tri = 40. For the case of m = 20 examples, the same net can only be expected to exhibit probable errors of .019 and ,023 for corresponding training target errors, which is compared graphically in Figure 4 with the experimentally determined errors for ttz = 20. The APP curves saturate at a value of iri that is insensitive to the prior density from which the nets are drawn. The vertical scale does depend somewhat on the prior however. Consider Figure 3, which also shows the APP curves for the same k = 20 net with the prior density antisymmetrically skewed 177uny from the true density by the following function: 1

U < d < l otherwise

For m > 20 the slznpes of the curves are almost unchanged, even though the vertical scale is different: saturation occurs at about the same value of in. Even when the prior greatly overrepresents poor nets, their effect on the prediction rapidly diminishes with training set size. This is important because in actual training, the effect of the initial configuration is also quickly lost. For ~n < 20 the predictions are not valid in any case, since

852

Griff L. Bilbro and David E. Van den Bout

our simple error function does not reflect the actual probability even approximately for in < k in these nets. It is for m < 20 where the only significant differences between the two families of curves occur. We have also been able to draw the same conclusions from less structured prior densities generated by assigning positive normalized random numbers to intervals of the domain. In fact, we were not able to produce curves worse than those of Figure 3. Moreover, we generally find that TLS predicts that about twice as many samples as neurons are needed to train competitive nets of other sizes. The previous curves were produced with large total sample sets N = 1000. We subsequently reran the experiments with N = 100 with essentially identical results (we d o not plot them because the two are indistinguishable). It is reassuring that the predictions of the theory are reliable even for relatively small sample sets.

6 Conclusion

We have derived the TLS theory of learning using the principle of maximum entropy instead of a combination of statistical mechanics and Bayes' rule. We show that TLS can be applied generally to learning and modeling. We apply it to learning a constant and to learning a one-dimensional density in the annealed approximation of TLS. We considered the effects of varying the number of examples m, the target training error CT (or equivalently P), and the choice of prior density o(')(w). These experiments on learning a density are consistent with learning a binary output (Bilbro and Snyder 19901, a ternary output (Chow et al. 19911, and a continuous output (Bilbro and Klenin 1990). We find if saturation occurs for m substantially less than the total number of available samples, say m < T / 2 , that in is a good predictor of sufficient training set size. Moreover there is evidence from a reformulation of the learning theory based on the grand canonical ensemble that also supports this statistical approach (Klenin 1990). For small problems the theory is very easy to use. We have no experience yet in applying the theory to large problems with more than 50 unknown weights or parameters. The TLS theory appears promising as a predictor of sufficient training set size; however, another question remains. It is difficult to obtain large overtraining effects from this theory even though some traces of overtraining remain: The curves of Figure 3 can be made to cross for small m and poor prior. In equation 3.16 of the elementary example, it is possible to obtain a positive F T , ~ However, ~ ~ . it appears that some aspects of overtraining do not survive the annealed approximation. This is consistent with the experience of other workers (Solla 1990).

Maximum Entropy and Learning Theory

853

References Bilbro, G. L., and Klenin, M. 1990. Thermodynamic models of learning: Applications. Unpublished. Bilbro, G. L., and Snyder, W. E. 1990. Learning theory, linear separability, and noisy data. CCSP-TR-90/7, Center for Communications and Signal I'rocessing, Box 7914, Raleigh, N C 27695-7914. Chow, M. Y., Bilbro, G. L., and Yee, S. 0. 1991. Application of learning theory to single-phase induction motor incipient fault detection artificial neural networks. Irit. J. N m r d S y s t . 2(1,2), 91-100. DeSieno, D. 1988. Adding a conscience to competitive learning. I n I E E E / r f k r riatiorin1 Corifcr~'~~cc O I I Nc~rrrnlNc?i[lorks, pp. 1:117-1:124. Jaynes, E. T. 1979. Where Do We Stand on Maximum Entropy? I n M n s i r r i ~ m Eritropy Forrnalisrrr, R. D. Leven and M. Tribus, eds., pp. 17-118. The MIT Press, Cambridge, MA. Klenin, M. 1990. Learning models and thermostatistics: A description of overtraining and generalization capacities. NETR-90/3, Center for Communications and Signal Processing, Neural Engineering Group, Box 7914, Raleigh, NC 27695-7914. Levin, E., Tishby, N., and Soll'i, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. l E E E 78(10), pp. 1568-1 574. Melton, M., Plian, T., Reeves, U., and Van den Bout, D. E. 1992. The TInMANN VLSI Chip. I E E E Trms. N w r n l Nrtzvorks 3(3). Ritter, H. 1991. Asymptotic lcvel density for a class of vector quantization processes. I E E E Trmis. N w r d Nrtworks 2(1), pp. 173-175. Rumelhart, D., and McClelland, J. 1986. Pnrnllrl Distributed Proccssirig: Ex$ratioris ir7 the Micr~stritrtrrrc~ of Copition, Chapter 5. The MIT Press, Cnmbridge, MA. Schwartz, D. B., Solla, S. A., Samalan, V. K., and Denker J. S. 1990. Exhaustive learning. Ntwrd Cornp. 2(3), pp. 374-385. Solla, S. A. 1990. Private communication. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. IJCNN, pp. 11:403410. IEEE, New York. ~

Received 1 November 1991; accepted 25 March 1992

This article has been cited by: 2. David J. Miller , Lian Yan . 2000. Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-Order Probability Constraints: Application to Statistical ClassificationApproximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-Order Probability Constraints: Application to Statistical Classification. Neural Computation 12:9, 2175-2207. [Abstract] [PDF] [PDF Plus] 3. Lian Yan, D.J. Miller. 2000. General statistical inference for discrete and mixed spaces by an approximate application of the maximum entropy principle. IEEE Transactions on Neural Networks 11:3, 558-573. [CrossRef] 4. Shie-Jue Lee, Mu-Tune Jone. 1996. An extended procedure of constructing neural networks for supervised dichotomy. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:4, 660-665. [CrossRef]

Communicated by Les Valiant

Tight Bounds on Transition to Perfect Generalization in Perceptrons Yuh-Dauh Lyuu Igor Rivin NEC Research Institute, Princeton, N ] 08540 U S A "Sudden" transition to perfect generalization in binary perceptrons is investigated. Building on recent theoretical works of Gardner and Derrida (1989) and Baum and Lyuu (1991), we show the following: for a > a , = 1.44797. if (Y n examples are drawn from the uniform distribution on { + l , -1)" and classified according to a target perceptron w tE { + l , -l}"as positive if w t x 2 0 and negative otherwise, then the expected number of nontarget perceptrons consistent with the examples is 2-@(fi); the same number, however, grows exponentially 2@(") if a < a,. Numerical calculations for n up to 1,000,002 are reported.

. .,

-

1 Introduction

We are interested in the following problem of sample complexity for valid generalization. Let rn examples be drawn from a probability distribution D and classified as positive or negative according to some threshold net f. We would like to pinpoint the value m where the expected number of non-target nets in a class of threshold nets, R,that consistently classify those examples shifts from exponential to negligible. Recently, Sompolinsky et al. (1990) suggest a phenomenon when the target function and the hypothesis function are both simple perceptrons (Minsky and Papert 1969) with fl weights and no hidden units, and the distribution D is uniform over the vectors {il.-1}". Using less rigorous methods of statistical physics, they argue that for m > 1.24n, perfect generalization is achieved as n + m; that is, with high probability, the target function is the only one consistent with the data set. Such conclusion finds agreement in simulation. (A conclusion similar to theirs is reached by Gyorgyi (1990) using the annealed approximation method.) Baum and Lyuu (1991)prove their conclusion rigorously for tn > 2.0821 n by showin that, then, the expected number of consistent hypotheses is only 2-@(&, which is also an upper bound on the probability for perfect generalization not to hold. Related theoretical work has been done by Baum and Haussler (1989). It is shown that if one can find a choice of weights such that at least a fraction 1 - ~ / of 2 a set of rn > 32zo/c ln(32 n ) / f random examples are Neural Computation 4, 854-862 (1992)

@ 1992 Massachusetts Institute of Technology

Transition to Perfect Generalization in Perceptrons

855

correctly classified, then, independent of the distribution D, one has high confidence of having error probability less than f on future examples. Here, ZLJ is the number of weights in the threshold net (i.e., the nets in IH) and I I is the number of nodes. Experiments by Bauni (1990) indicate >> io >> 1, that if D is uniform over the interior of the unit cube, and then the error rate is in fact F = zo/m for multilayer perceptrons. The following is an outline of our result. Let zu, E { +l. -l}” be the target perceptron. We are given a set of o. I I examples, where each example is a vector x drawn uniformly from {+l.-1}“ and classified as positive or negative according to whether XJ, . x 0 or not. For ( r > ( i t 1.44797. . ., the expected number of lion-ill, perceptrons consistent with the examples is shown to be 2-“)‘\’‘I, which is also an upper bound on the probability that there exists one such perceptron; the same number, however, grows exponentially 2“’”) if ( I i( I ( . . This is done in Section 4. Such an ( i t has been reached by Gardner and Derrida (1989). But their proof is not entirely correct; worse, it leads to false conclusions (Remark 4.3). We note that our result improves the one given by Baum and Lyuu (1991). Numerical calculations for I i up to 1.000.002 are reported in Section 5. 2 Definitions and Terminology

We follow the definitions and terminology used by Baum and Lyuu (1991). The number of examples will be denoted by 171 = ( I ti. Without loss of generality, assume I i > 0 is even and the target perceptron is ZLJ, =

( + I .+ I . . . . .

1

Let P(i) denote the probability that a perceptron y E {+l.-l}” at Ham~ a random x E { + l . -1 }”. Define ming distance i from 7 1 ~misclassifies

and

1-1

As there are exactly

perceptrons at Hamming distance i from zi+, the target perceptron, z , denotes the expected number of consistent perceptrons that are at Hamming perdistance i from zi+. Clearly, A(n.( I ) is the expected number of non-ZL’, ceptrons y E {+l.-l}” that consistently classify 171 random examples.

Yuh-Dauh Lyuu and Igor Rivin

856

Let b(k;n. p ) be the probability that n Bernoulli trials with probabilities p for success and q = 1 -p for failure result in k successes and n-k failures. Denote the number of successes in n trials by S,,,. Hence b(k;n.p) = Pr(S,,,p= k ) . Define &,k = b ( n / 2 + k;n. 112) = Pr(S,,,1/2= n / 2 + k ) , which measures the probability that the result of n Bernoulli trials, S,,,,, deviates from the mean, n/2, by k. We let all.k = 0 for Ikl > n/2. Note that aii.-k

= an.k

and

Throughout this paper, p will be 1/2. Finally, h ( n ) N g ( n ) means h ( n ) / g ( n )+ 1 as n

-

m.

3 Preliminary and Some Known Results

We list known results used by our proof in the next section. The first result can be easily proved from Stirling’s formula (Bollobas 1985, p. 4). Fact 3.1 (Baum and Lyuu 1991, Lemma 3.5) For an even integer a,

The next result is an exact recurrence relation for the P(i)s. Since its proof is very involved, we refer the interested reader to Baum and Lyuu (1991). Fact 3.2 (Baum and Lyuu 1991, Theorem 3.8) For i even, we have

and

P(i) - P(i - 1) = 0 We will need good bounds on the tails of normal distribution and sum of binomial distributions. The following two facts will be used later. Fact 3.3 (Bollobas 1985, Chernoff bound)

Fact 3.4 (Bollobiis 1985, p. 9) As x

--t

03,

857

Transition to Perfect Generalization in Perceptrons

We also need the DeMoivre-Laplace Limit Theorem to approximate the probability that S,,1 l 2 lies within a certain range near its mean, 11/2, by normal distributions. Note its range of applicability as expressed in how 111 and 122 may grow as I I and that their difference must go to infinity as n -, x.

Fact 3.5 (Bollobas 1985, p. 13) Suppose h, < I I ~ , I I ? - hl l11lI + lhzl = o [ ( r ~ / 4 ) ~ / 'Let ] . / I , = x,(i1/4)"~.Tlzeri Pr(n/2

-

x as 11

- x ,nrid

- a. 1''

+ hl 5 S,, 1/2 5 12/2 + 1 1 ~ )

1

-

e-l2I2dt

11

The following simple lemma will also be used.

Lemma 3.6 For a i l / i 5 1112(i may depend O H i f ) , we Iznve I

all-,,^!, = 0 (1Vfi) /=(I

as n

--t

m.

Proof. For simplicity, let both

IZ

and i be even. Now,

which is the last equation above from Fact 3.1. Symmetrically, one can prove that

A

= 0(

1 / 6 )

Hence, A = O ( m i n ( l / f i . 1/4)], which clearly proves the lemma no matter how i may depend on 11. Q.E.D. 4 The Main Result

The main result of this paper, to be proved in this section, is the following.

Theorem 4.1

By a proof that is as involved as that of Fact 3.2, Baum and Lyuu (1991, pp. 394-395) show that 2 , = 2-R(r')for any ( I > 1 when i 2 ,212. Hence, we only have to consider those 2 , terms with 1 5 i 5 ~ / 2 .We shall assume n/2 is odd and i even without loss of generality. The probability that a perceptron y at Hamming distance z from ?1+ classifies m random examples is calculated as follows. Assume, without loss of generality, that y's first i components are -1 and the rest are +l.

858

Yuh-Dauh Lyuu and Igor Rivin

Let Z denote the sum of the first i components of the example x and Y that of the other iz - i components. Clearly, Y + Z and Y Z denote ZUr. x and y . x, respectively. For y to classify x as does, we must have: Y + Z 2 0 iff Y - Z 2 0 and Y + Z < 0 iff Y - Z < 0. Hence, the probability that y classifies rn random examples consistently is ~

~ ( i dlf ) 1 - P(i) = [Pr(O 5 5 Y) + Pr(O < -Z 5 Y) + Pr(O 5 Z < -Y)+ Pr(O < -Z < -Y)]"'

z

where Y is the sum of 11 - i random numbers fl,whereas Z is the sum of i random numbers +l. So, R(i) is equal to

J which, after some manipulations, lead to

[

((lJ+f

y=o

5+

?=-I/

q

-1

all-,

-

2

C

I/= ( l l - J ) / Z

all-, ,al

l,=-(l,-l)/2z=l/

1

From Lemma 3.6 we know that the second term within the brackets in the above formula is of order o(1). Hence we end up with

(4.1) Remark 4.2 We remark that Y and Z have limiting distributiom that are normal with expectation 0 and standard deviations 6 1 2 and 4 1 2 , respectively. First, for i = o ( n ) ,we have

(n - i ) / 2

2 N

-

T

fi

2" - 1

'J

1

Jic.-i,

Hence, P ( i ) < c for some constant c as n be o(n/ log3n ) , we have

+ m.

Further limiting i to

Transition to Perfect Generalization in Perceptrons

859

So we have shown that, for i = o(n/ln3 IZ), each z, is of size 2-"('") 2- 0. We proceed to bound each z, where i = 12( t z / In3 n ) . Observe that

Since the second term in 4.2 is o(1) as n substitute it with

as 2

& s:,fi~-'/' ~

(&

+

e-"i2ilz) dy

=

cx; by using Fact 3.3, we can

=

o ( 1 ) by Fact 3.4. The first

term of 4.2 is, by Fact 3.5,

The above two approximations combined with formula 4.1, we have

which is [Gardner and Derrida 1989, Eq. (2311 (4.4) where x = 1 - ( i / ~ ) . As in Gardner and Derrida (1989), we have

rv

e ~ ~ { -1 i ni-il-i)

lii(l-i)+rb

In[

$

tan-I(fi)+o(lj]}

(4.5)

One can easily show that, for ( t > 1.44797.. ., the exponent is less than zero for all x, whereas, for < 1.44797.. ., the exponent can be positive for some constant x [i.e., some i = @(ti)]. Thus we conclude our proof.

Yuh-Dauh Lyuu and Igor Rivin

860

Remark 4.3 Here we comment on the reasons why Gardner and Derrida's argument is not correct. Using the Central Limit Theorem, they jump directly from Remark 4.2 to the claim that 4.1 = 4.3 = 4.4 without any qualification on the size of i, i.e., without going through the steps that we took for the case i = o(n/ log3n ) . Another way of saying the same thing is they take the DeMoivre-Laplace Limit Theorem, Fact 3.5, and immediately apply it to 4.1 to obtain 4.3, without taking into account that theorem's range of applicability. Hence, their result does not apply when i is, for example, "small." In fact, their formula 4.5, used without qualifications, would imply that A(n,rr)= 2-'(") for any o > ac, which is false by Theorem 4.1. 5 Numerical Results and Conclusion

We use the Mathernatica 2.0 program in Figure 1 to calculate A(n,tr). Figure 2 contains the log base e plot of A(n.1.448) against n. Although Theorem 4.1 predicts that it should eventually approach -@(fi), this graph fits perfectly the linear function 0.800727 - 9.8459 x 1OP6n. The explanation is that, at N = 1.448, which is very close to rxc = 1.44797. . ., terms of size 2-'(") in A(n.1.448) may dominate those of size 2-'(fi) for relatively big n before they start to be dominated. In Figure 3, we pick N = 2, which is much farther from the critical value ac, plot

Nbinom[ n-, i-,

Nbinom2[ i-,

prec-:

prec-:

LogGamma[ N[ n + 1, prec ] ] . LogGamma[ N[ i + 1, prec ] ] LogGamma[ N[ n - i + 1, prec ] I ]

40 ] := Exp[

40 ] := Exp[ LogGamma[ N[ 2 i + 1, prec ] ] ( 2 LogGamma[ N[ i + 1, prec ] ] + ( 2 i ) Log[ " 2, prec 1 1 ) I

P[ n-, 0 1 = 0 P[ n-, i-] := P[ n, i ] = If[ EvenQ[ i ] , P[ n, i - 1 1, P[ n, i - 1 ] + Nbinom2[ ( i - 1 ) / 2 ] Nbinom2[ ( n - i ) / 2 ] ] A[ n-,

alpha-] := Block[ ( sum = 0,pi = 0 ), For [ i = 1, I C=

n,

i++, If[ OddQ[ i 1, pi += Nbinom2[ ( i - 1 ) / 2 ] Nbinom2[ ( n - i + 1) / 2 ] 1; term = Nbinom[ n, i ] ( i - pi )"( n alpha ); sum += term 1; sum ]

Figure 1: Mathematica program used to calculate A(n.a ) .

861

Transition to Perfect Generalization in Perceptrons

0 0 0

0

0

Figure 2: A plot of log,[A(rz.1.448)].

Figure 3: A plot of [log,A(rz.2.0)]*.

862

Yuh-Dauh Lyuu and Igor Rivin

[log, d ( n , 2.0)]*against n, and find it to be perfectly fit by the linear function -7241.51 2.46168n (drawn as a line), showing A ( n .2.0) grows at roughly ecm as the theorem says. In the other direction, we quickly get exponential blow-up when ( v is even just a little bit below oc. For example, d(ll000,002.1.447)E= 3.44762391 x We have not been able to prove rigorously the stronger claim (Gyorgyi 1990; Sompolinsky et al. 1990) that, for an (1' around 1.24, the probability that there exist nontarget perceptrons consistent with ( I n random examples, where N > o', is negligible. Our result puts n' < 1.44797. . . .

+

Acknowledgments The authors thank Eric B. Baum and David Jagerman for discussions. The authors also thank two anonymous referees whose suggestions substantially improved the presentation and accessibility of this paper.

References Baum, E. B. 1990. What can back propagation and k-nearest neighbor learn with feasible sized sets of examples? In Neurd Networks, E"AASIP Workshop 1990 Proceedings, L. B. Almeida and C. J. Wellekens, eds., pp. 2-25. Lecture Notes in Computer Science Series, Springer-Verlag, Berlin. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neurul Comp. 1, 151-160. Baum, E. B., and Lyuu, Y.-D. 1991. The transition to perfect generalization in perceptrons. Neirral Comp. 3, 386-401. Bollobas, B. 1985. Random Graphs. Academic Press, New York. Gardner, E., and Derrida, B. 1989. Three unfinished works on the optimal storage capacity of networks. I. Phys. A: Math. Gen. 22, 1983-1994. Gyorgyi, G. 1990. First-order transition to perfect generalization in a neural network with binary synapses. Plzys. Rev. A 41, 7097-7100. Minsky, M., and Papert, S. 1969. Perceptrons: An lntrodirctioii to Cornputatioizal Geometry. The MIT Press, Cambridge, MA. Sompolinsky, H., Tishby, N., and Seung, H. 1990. Learning from examples in large neural networks. Phys. Rev. Lett. 65(13), 1683-1686. _

~

_

_

_

_

~

~

Received 14 November 1991; accepted 26 March 1992.

This article has been cited by: 2. Shao C. Fang, Santosh S. Venkatesh. 1999. Learning finite binary sequences from half-space data. Random Structures and Algorithms 14:4, 345-381. [CrossRef] 3. David Haussler, Michael Kearns, H. Sebastian Seung, Naftali Tishby. 1997. Rigorous learning curve bounds from statistical mechanics. Machine Learning 25:2-3, 195-236. [CrossRef]

Communicated by Harry Barrow

Learning Factorial Codes by Predictability Minimization Jurgen Schmidhuber Departmerit of Coriipirter Scicriic, Utiizlcrsity of Colorndo, Boirlder, CO 80309 U S A

I propose a novel general principle for unsupervised learning of distributed nonredundant internal representations of input patterns. The principle is based on two opposing forces. For each representational unit there is an adaptive predictor which tries to predict the unit from the remaining units. In turn, each unit tries to react to the environment such that it minimizes its predictability. This encourages each unit to filter “abstract concepts” out of the environmental input such that these concepts are statistically independent of those on which the other units focus. I discuss various simple yet potentially powerful implementations of the principle that aim at finding binary factorial codes (Barlow e t al. 1989), i.e., codes where the probability of the occurrence of a particular input is simply the product of the probabilities of the corresponding code symbols. Such codes are potentially relevant for (1) segmentation tasks, (2) speeding u p supervised learning, and (3) novelty detection. Methods for finding factorial codes automatically implement Occam’s razor for finding codes using a minimal number of units. Unlike previous methods the novel principle has a potential for removing not only linear but also nonlinear output redundancy. Illustrative experiments show that algorithms based on the principle of predictability minimization are practically feasible. The final part of this paper describes an entirely local algorithm that has a potential for learning unique representations of extended input sequences. 1 Introduction

Consider a perceptual system being exposed to an unknown environment. The system has some kind of internal ”state” to represent external events. We consider the general case where the state is an n-dimensional distributed representation yl’ (a vector of real-valued or binary code symbols) created in response to the pth input vector XI’. An ambitious and potentially powerful objective of unsupervised learning is to represent the environment such that the various parts of the representation are statistically imleyeiideiit of each other. In other words, we would like to have methods for decomposing the environment into entities that belong together and do not have much to do with other Nritrnl Comprtntioii 4, 863-879 (1992)

@ 1992 Massachusetts Institute of Technology

864

Jurgen Schmidhuber

entities' ("learning to divide and conquer"). This notion is captured by the concept of "discovering factorial codes" (Barlow et a / . 1989). The aim of "factorial coding" is the following: Given the statistical properties of the inputs from the environment, find iizverfible internal representations such that the occurrence of the ith code symbol y; is independent of any of the others. Such representations are called factorial because they have a remarkable and unique property: The probability of the occurrence of a particular input is simply the product of the probabilities of the corresponding code symbols. Among the advantages of factorial codes are as follows: 1. "Optimal" ii7piit se,gi~eiztation. An efficient method for discovering mutually independent features would have consequences for many segmentation tasks. For instance, consider the case where the inputs are given by retinal images of various objects with mutually independent positions. At any given time, the activations of nearby "pixels" caused by the same object are statistically correlated. Therefore a factorial code would not represent them by different parts of the internal storage. Instead, a factorial code could be created by finding input transformations corresponding to the abstract concept of "object position": Positions of different objects should be represented by different code symbols.

2. Speeding zip siipervised Ieamiizg. As Becker (1991) observes, if a representation with uncorrelated components is used as the input to a higher level linear supervised learning network, then the Hessian of the supervised network's error function is diagonal, thus allowing efficient methods for speeding up learning (note that statistical independence is a stronger criterion than the mere absence of statistical correlation). Nonlinear networks ought to profit as well.

3. Occam's razor. Any method for finding factorial codes automatically implements Occam's razor which prefers simpler models over more complex ones, where simplicity is defined as the number of storage cells necessary to represent the environment in a factorial fashion. If there are more storage cells than necessary to implement a factorial code, then the independence criterion is met by letting all superfluous units emit constant values in response to all inputs. This implies storage efficiency, as well as a potential for better generalization capabilities. 4. Novelty detection. As Barlow et al. (1989) point out, with factorial codes the detection of dependence between two symbols indicates hitherto undiscovered associations. 'The G-Max algorithm (Pearlmutter and Hinton 1986) aims at a related objective: It tries to discover features that account for input redundancy. G-Max, however, is designed for single output units only.

Learning Factorial Codes

865

Barlow et al. do not present an efficient general method for finding factorial codes (but they propose a few sequential ”nonneural” heuristic methods). Existing “neural” methods for decreasing output redundancy (e.g., Linsker 1988; Zemel and Hinton 1991; Oja 1989; Sanger 1989; Foldikk 1990; Rubner and Schulten 1990; Silva and Almeida 1991) are mostly restricted to the linear case and do not aim at the ambitious general goal of statistical independence. In addition, some of these methods require gaussian assumptions about the input and output signals, as well as the explicit calculation of the derivative of the determinant of the output covariance matrix (Shannon 1948). The main contribution of this paper is a simple but general “neural” architecture (plus the appropriate objective functions) for finding factorial codes. I would not be surprised, however, if the general problem of finding factorial codes turned out to be NP-hard. In that case, gradient-based procedures as described herein could not be expected to always find factorial codes. The paper at hand focuses on the novel basic principle without trying to provide solutions for the old problem of local maxima. Also, the purpose of this article is not to compare the performance of algorithms based on the novel principle to the performance of existing sequential ”nonneural” heuristic methods (Barlow et a!. 1989). The toyexperiments described below are merely for illustrative purposes. 2 Formulating the Problem

Let us assume iz different adaptive input processing represeiztatioiinl modides that see the same single input at a time. The output of each module can be implemented as a set of neuron-like units. Throughout this paper I focus on the simplest case: One output unit (also called a representational unit) per module. The ith module (or unit) produces an output value yr E [0.1] in response to the current external input vector XI’. In what follows, P ( A ) denotes the probability of event A , P ( A I €3) denotes the conditional probability of event A given B, y, denotes the mean of the activations of unit i, and E denotes the expectation operator. The methods described in this paper are primarily devoted to finding binary or at least quasibinary codes. Each code symbol participating in a quasibinary code is either 0 or 1 in response to a given input pattern or emits a constant value in response to every input pattern. Therefore, binary codes are a special case of quasibinary codes. Most of our quasibinary codes will be created by starting out from real-valued codes. Recall that there are three criteria that a binary factorial code must fulfill: 1 . The binary criterion: Each code-symbol should be either 1 or 0 in response to a given input pattern.

Jiirgen Schmidhuber

866

2. The invertibility criterion: It must be possible to reconstruct the input from the code. In cases where the environment is too complex (or too noisy) to be fully coded into limited internal representations (i.e., in the case of binary codes where there are more than 2dim(y) input patterns), we want to relax the invertibility criterion. In that case, we still want the internal representations to convey maximal information about the inputs. The focus of this paper, however, is on situations like the ones studied in Barlow et al. (1989): Noisefree environments and sufficient representational capacity in the representational units. In the latter case, reversibility is equivalent to Infomax as per Linsker (1988). 3. The independence criterion: The occurrence of each code symbol ought to be independent of all other code symbols. If the binary criterion is fulfilled, then we may rewrite the independence criterion by requiring that

The latter condition implies that y, does not depend on {yk. k # i}. In other words, E(y, I {yk. k # i}) is computable from a constant. Note that with real-valued codes the criterion Vi : E(y, I {yk. k # i}) = E(y,) does not necessarily imply that the yk are independent. 3 The Basic Principle and Architecture

For each representational unit i there corresponds an adaptive predictor P;,which, in general, is nonlinear. With the pth input pattern xp, Pi’s input is the concatenation of the outputs of all units k # i. Pi’s onedimensional output is trained to equal the expectation E(y, I ( 6 , k # i}). It is well known that this can be achieved by letting P; minimize2

6

(3.1)

With the help of the n predictors one can define various objective functions for the representational modules to enforce the three criteria listed above (see Sections 4 and 5). Common to these methods is that all units are trained to take on values that minimize mutual predictability via the predictors: Each unit tries to extract features from the environment such that no combination of n - 1 units conveys information about the remaining unit. In other words, no combination of n - 1 units should allow better predictions of the remaining unit than a prediction based on 2Cross-entropyis another objective function for achieving the same goal. In the experiments, however, the conventional mean squared error based function 3.1 led to satisfactory results.

Learning Factorial Codes

867

a constant. I call this the principle of intrarepresetitational predictability miniinizatioiz or, somewhat shorter, the principle of predictability minimization. A major novel aspect of this principle that makes it different from previous work is that it uses adaptive submodules (the predictors) to define the objective functions for the subjects of interest, namely, the representational units themselves. Following the principle of predictability minimization, each representational module tries to use the statistical properties of the environment to protect itself from being predictable. This forces each representational module to focus on aspects of the environment that are independent of environmental properties upon which the other modules focus. 4 Objective Functions for the Three Criteria

Sections 4.1, 4.2, and 4.3 provide objective functions for the three criteria. Sections 4.4, 4.5, and 4.6 describe various combinations of these objective functions. Section 4.7 hints at a parameter tuning problem. A way to overcome it (my preferred method for implementing predictability minimization) is presented in Section 5. 4.1 An Error Function for the Independence Criterion. For the sake of argument, let us assume that at all times each PI is as good as it can be, meaning that PI always predicts the expectation of yl conditioned on the outputs of the other modules, E(y, I {y!. k # i } ) . (In practice, the predictors will have to be retrained continually.) In the case of quasibinary codes the following objective function H is zero if the independence criterion is met:

H

=

1 -Ex [Pr 2 1

-

yl]’

(4.1)

P

This term for mutual predictability minimization aims at making the outputs independent-similar to the goal of a term for maximizing the determinant of the covariance matrix under gaussian assumptions (Linsker 1988). The latter method, however, tends to remove only linear predictability, while the former can remove nonlinear predictability as well (even without gaussian assumptions), due to possible nonlinearities learnable by nonlinear predictors. 4.2 An Objective Function for the Binary Criterion. A well-known objective function V for enforcing binary codes is given by

Maximizing this term encourages each unit to take on binary values. The contribution of each unit i is maximized if E(yi) is as close to 0.5

868

Jiirgen Schmidhuber

as possible. This implies maximal entropy for unit i under the binary constraint, i.e., i wants to become a binary unit that conveys maximal information about its input. 4.3 An Error Function for the Invertibility Criterion. The following is a simple, well-known method for enforcing invertibility: With pattern p , a reconstructor module receives the concatenation of all as an input and is trained to emit as an output the reconstruction zp of the external input xp. The basic structure is an auto encoder. The auto encoder’s objective function, to be minimized, is defined as

4.4 Combining Error Terms. A straightforward way of combining V, I, and H is to maximize the total objective function

where a, j3, and y are positive constants determining the relative weighting of the opposing error terms. Maximization of 4.3 tends to force the representational units to take on binary values that maximize independence in addition to minimizing the reconstruction error.3 4.5 Removing the Variance Term: Real-Valued Codes. If with a specific application we want to make use of the representational capacity of real-valued codes and if we are satisfied with decorrelated (instead of independent) representational units, then we might remove the V-term from 4.3 by setting a = 0. In this case, we want to minimize

Note that with real-valued units the invertibility criterion theoretically can be achieved with a single unit. In that case, the independence criterion would force all other units to take on constant values in response to all input patterns. In noisy environments, however, it may turn out to be advantageous to code the input into more than one representational unit. This has already been noted by Linsker (1988) in the context of his Infomax principle. 30ne might think of using Lagrangian multipliers (instead of arbitrary a,0, y) to rigidly enforce constraintssuch as independence. However, to use them the constraints would have to be simultaneously satisfiable. Except for special input distributions this seems to be unlikely (see also Section 4.7).

Learning Factorial Codes

869

4.6 Removing the Global Invertibility Term. Theoretically it is sufficient to do without the auto encoder and set d = 0 in 4.3. In this case, we simply want to maximize

T=flV-yH The H-term counteracts the possibility that different (near-)binary units convey the same information about the input. Setting , j = 0 means to maximize information locally for each unit while at the same time trying to force each unit to focus on different pieces of information from the environment. Unlike with autoassociators, there is I Z O global invertibility term. Note that this method seemingly works diametrically opposite to the sequential, heuristic, non-neural methods described by Barlow et nl. (1989), where the sum of bit entropies is minimized instead of being maximized. How can both methods pursue the same goal? One may put it this way: Among all invertible codes, Barlow et a / . try to find those closest to something similar to the independence criterion. In contrast, among all codes fulfilling the independence criterion (ensured by sufficiently strong ?), the above methods try to find the invertible ones. 4.7 A Disadvantage of the Above Methods. Note that a factorial code causes nonmaximal V and therefore nonmaximal T for all methods with (1 > 0 except for rare cases (such as if there are 2” equally probable different input patterns). This means that with a given problem there is some need for parameter tuning of the relative weighting factors, due to the possibility that the various constraints may not be satisfiable simultaneously (see footnote 3). The method in the next section avoids this necessity for parameter tuning by replacing the term for variance maximization by a predictor-based term for conditioned variance maximization.

5 Local Conditioned Variance Maximization This is the author’s preferred method for implementing the principle of predictability minimization. It does not suffer from the parameter tuning problems involved with the V-term above. It is extremely straightforward and reveals a striking symmetry between opposing forces. Let us define

Recall that Py is supposed to be equal to E(y, I {yi.k # i } ) ,and note that 5.1 is formally equivalent to the sum of the objective functions E , of the predictors (equation 3.1).

Jurgen Schmidhuber

870

As in Section 4.6 we drop the global invertibility term and redefine the total objective function T to be maximized by the representational modules as

T

=

v c - yH

(5.2)

Conjecture. I conjecture that if there exists a quasibinary factorial code for a given pattern ensemble, then among all possible (real-valued or binary) codes T is maximized with a quasi-binary factorial code, even if 7 =o. If this conjecture is true, then we may forget about the H-term in 5.2 and simply write T = VC. In this case, all representational units simply try to maximize the same function that the predictors try to minimize, namely, Vc. In other words, this generates a symmetry between two forces that fight each other-one trying to predict and the other one trying to escape the predictions. The conjecture remains unproven for the general case. The long version of this paper, however, mathematically justifies the conjecture for certain special cases and provides some intuitive justification for the general case (Schmidhuber 1991). In addition, algorithms based solely on Vc-maximization performed well in the experiments to be described below. 6 "Neural" Implementation

In a realistic application, of course, it is implausible to assume that the errors of all P; are minimal at all times. After having modified the functions computing the internal representations, the Pi must be trained for some time to assure that they can adapt to the new situation. Each of the n predictors, the n representational modules, and the potentially available autoassociator can be implemented as a feedforward backpropagation network (e.g., Werbos 1974). There are two alternating passes-one for minimizing prediction errors and the other one for maximizing T . Here is an off-line version based on successive "epochs" (presentations of the whole ensemble of training patterns):

PASS 1 (minimizing prediction errors): Repeat for a "sufficient" number of training epochs: 1. For all p: 1.1. For all i: Compute fl. 1.2. For all i: Compute Pr. 2. Change each weight w of each P, according to

where rp is a positive constant learning rate.

Learning Factorial Codes

871

PASS 2 (minimizing predictability): 2. For all p: 2.1. For all i: Compute yr. 2.2. For all i: Compute Py. 2.3. If an autoassociator is involved, compute zp.

2 . Change each weight u of each representational moduleaccording to

where 7/R is a positive constant learning rate. The weights of the P, do not change during this pass, but all other weights do change. Note that PASS 2 requires backpropagation of error signals through the predictors (uiithout changing their zoeights) and then through their n 1 input units (which are the output units of the representatiorial modules) down to the weights of the representational modules. ~

The off-line version above is perhaps not as appealing as a more local procedure where computing time is distributed evenly between PASS 2 and PASS 1: An on-line version. An extreme on-line version does not sweep through the whole training ensemble before changing weights. Instead it processes the same single input pattern xp (randomly chosen according to the input distribution) in both PASS 1 and PASS 2 and immediately changes the weights of all involved networks simultaneously, according to the contribution of xp to the respective objective functions. Simultaneous updating of the representations and the predictors, however, introduces a potential for instabilities. Both the predictors and the representational modules perform gradient descent (or gradient ascent) in changing functions. Given a particular implementation of the basic principle, experiments are needed to find out how much on-line interaction is permittable. With the toy-experiments reported below, on-line learning did not cause major problems. It should be noted that if T = Vc = C,Ep, (Section 5 ) , then with a given input pattern we may compute the gradient of Vc with respect to both the predictor weights and the weights of the representation modules in a single pass. After this we may simply perform gradient descent in the predictor weights and gradient ascent in the remaining weights (it is just a matter of flipping signs). This was actually done in the experiments. Local maxima. Like all gradient ascent procedures, the method is subject to the problem of local maxima. A standard method for dealing with local maxima is to repeat the above algorithm with different weight initializations (using a fixed number nF of training epochs for each repetition) until a (near-)factorial code is found. Each repetition corresponds

872

Jiirgen Schmidhuber

to a local search around the point in weight space defined by the current weight initialization. Shared hidden unifs. It should be mentioned that some or all of the representational modules may share hidden units. The same holds for the predictors. Predictors sharing hidden units, however, will have to be updated sequentially: No representational unit may be used to predict its own activity.

7 Experiments

All the illustrative experiments described below are based on T as defined in Section 5, with y = 0. In other words, the representational units try to maximize the same objective function Vc that the predictors try to minimize. All representational modules and predictors were implemented as 3-layer backpropagation networks. All hidden and output units used logistic activation functions and were connected to a bias-unit with constant activation. Parameters such as learning rates and number of hidden units were not chosen to optimize performance-there was no systematic attempt to improve learning speed. Daniel Prelinger and Jeff Rink implemented on-line and off-line systems based on Section 6 [see details in Schmidhuber (1991) and Prelinger (1992)l. The purpose of this section, however, is not to compare on-line and off-line versions but to demonstrate that both can lead to satisfactory results. With the off-line version, the sufficient number of consecutive epochs in PASS 1 was taken to be 5. With the on-line system, at any given time, the same single input pattern was used in both PASS 1 and PASS 2. The learning rates of all predictors were 10 times higher than the learning rates of the representational modules. An additional modification for escaping certain cases of local minima was introduced (see Schmidhuber 1991; Prelinger 1992). The significance of nonlinearities. With many experiments it turned out that the inclusion of hidden units led to better performance. Assume that dim(y) = 3 and that there is an XOR-like relationship between the activations of the first two representational units and the third one. A linear predictor could not possibly detect this relationship. Therefore the representational modules could not be encouraged to remove the redundancy. The next subsections list some selected experiments with both the online and the off-line method. In what follows, the term "local input representation" means that there are dim(x) different binary inputs, each with only one nonzero bit. The term "distributed input representation" means that there are 2d'm(x) different binary inputs. With all experiments, a representational unit was considered to be binary if the absolute difference

Learning Factorial Codes

873

between its possible activations and either the maximal or the minimal activation permitted by its activation function never exceeded 0.05. Local maxima. With some of the experiments, multiples of 10,000 training epochs were employed. In many cases, however, the representational units settled into a stable code long before the training phase was over (even if the code corresponded to a suboptimal solution). The repetitive method based on varying weight initializations (Section 6) sometimes allowed shorter overall learning times (using values T I E of the order of a few 1000). A high number of repetitions increases the probability that a factorial code is found. Again it should be emphasized, however, that learning speed and methods for dealing with local maxima are not the main objective of this paper.

7.1 Uniformly Distributed Inputs. With the experiments described in this subsection there are 2di"'(!/)different uniformly distributed input patterns. This means that the desired factorial codes are the full binary codes. In the case of a factorial code all predictors emit 0.5 in response to every input pattern (this makes all conditioned expectations equal to the unconditioned expectations). Experimenf 1. Off-line, dim(y) = 2, dim(x) = 4, local input representation, 3 hidden units per predictor, 4 hidden units shared among the representational modules. Ten test runs with 20,000 epochs for the representational modules were conducted. In 8 cases this was sufficient to find a binary (factorial) code. Experiment 2. On-line, dim(y) = 2, dim(x) = 2, distributed input representation, 2 hidden units per predictor, 4 hidden units shared among the representational modules. Ten test runs were conducted. Less than 3000 pattern presentations (equivalent to ca. 700 epochs) were always sufficient to find a binary factorial code. Experiment 3 . Off-line, dim(y) = 4,dim(x) = 16, local input representation (16 patterns), 3 hidden units per predictor, 16 hidden units shared among the representational modules. Ten test runs with 20,000 epochs for the representational modules were conducted. In 1 case the system found a n invertible factorial code. In 4 cases it created a near-factorial code with only 15 different output patterns in response to the 16 input patterns. In 3 cases it created only 13 different output patterns. In 2 cases it created only 12 different output patterns. Experiment 4. On-line, dim(y) = 4, dim(x) = 4, distributed input representation (16 patterns), 6 hidden units per predictor, 8 hidden units shared among the representational modules. Ten test runs were conducted. In all cases but one the system found a factorial code within less than 4000 pattern presentations (corresponding to less than 300 epochs).

Jiirgen Schmidhuber

874

7.2 Occam’s Razor at Work. The experiments in this section are meant to verify the effectiveness of Occam’s razor, mentioned in the introduction. It is interesting to note that with nonfactorial codes predictability minimization prefers to reduce the number of used units instead of minimizing the sum of bit-entropies as per Barlow et d. (1989). This can be seen by looking at an example described by Mitchison in the appendix of the paper of Barlow et al. This example shows a case where the minimal sum of bit-entropies can be achieved with an expansive local coding of the input. Local representations, however, maximize mutual predictability: With local representations, each unit can always be predicted from all the others. Predictability minimization tries to avoid this by creating nonlocal, nonexpansive codings.

Experiment 1 . Off-line, dim(y) = 3, dim(x) = 4, local input representation, 3 hidden units per predictor, 4 hidden units shared among the representational modules. Ten test runs with 10,000 epochs for the representational modules were conducted. In 7 cases the system found a binary factorial code: In the end, one of the output units always emitted a constant value. In the remaining 3 cases, the code was at least binary and invertible. Experimenf 2. Off-line, dim(y) = 4, dim(x) = 4, local input representation, 3 hidden units per predictor, 4 hidden units shared among the representational modules. Ten test runs with 10,000 epochs for the representational modules were conducted. In 5 cases the system found a binary factorial code: In the end, two of the output units always emitted a constant value. In the remaining cases, the code did not use the minimal number of output units but was at least binary and invertible. Experiment 3. On-line, dim(y) = 4, dim(x) = 2, distributed input representation, 2 hidden units per predictor, 4 hidden units shared among the representational modules. Ten test runs with 250,000 pattern presentations were conducted. This was sufficient to always find a quasibinary factorial code: In the end, two of the output units always emitted a constant value. In 7 out of 10 cases, less than 100,000 pattern presentations (corresponding to 25,000 epochs) were necessary. 7.3 Nonuniformly Distributed Inputs. The input ensemble considered in this subsection consists of four different patterns denoted by x“, xb, xc, and 9, respectively. The probabilities of the patterns were

P(f)=

1 9

-;

P(Xb) =

2 -, 9

P(x‘)

2 9

= -,

4 9

P ( 2 )= -

This ensemble allows for binary factorial codes, one of which is denoted by the following code F : y”

= (1, l)‘,

yb = (0,

yc = ( l , O ) T ,

y” = (0.0)’

Learning Factorial Codes

875

With code F , the total objective function VCbecomes V [ = 2. A nonfactorial but invertible (information-preserving) code is given by

With code B, V C = 19/10, which is only 1/10 below V:. This already indicates that certain local maxima of the internal state's objective function may be very close to the global maxima. Experiment 1 . Off-line, dim(y) = 2, dim(x) = 2, distributed input representation with x" = ( O . O ) r , xb = ( O . l ) T , x' = (l.O)', 1"' = ( l . l ) r ,1 hidden unit per predictor, 2 hidden units shared among the representational modules. Ten test runs with 2000 epochs for the representational modules were conducted. Here one epoch consisted of the presentation of 9 patterns--x" was presented once, X' was presented twice, xL was presented twice, x" was presented four times. In 7 cases, the system found a global maximum corresponding to a factorial code. In the remaining cases the code was not invertible. Experiment 2 (Occam's Razor). As experiment 1, but with dim(y) = 3. In all but one of the 10 test runs the system developed a factorial code (including one unused unit). In the remaining test run the code was at least invertible. With local input representation and dim(x) = 4, dim(y) = 2, the success rate dropped below 50%. With dim(y) = 3, the system usually found invertible but rarely factorial codes. This reflects the fact that with certain input ensembles there is a trade-off between redundancy and invertibility: Superfluous degrees of freedom among the representational units may increase the probability that an information-preserving code is found, while at the same time decreasing the probability of finding an optimal factorial code.

8 Predictability Minimization and Time Let us now consider the case of input sequences. This section describes an entirely local method designed to find unambiguous, nonredundant, reduced sequence descriptions. The initial state vector yI'(0) is the same for all sequences p. The input at time t > 0 of sequence p is the concatenation xP(t) o yP(t - I) of the input x'(t) and the last internal state yP(t - 1). The output is yJ'(t) itself. We minimize and maximize essentially the same objective functions as described above. That is, for the ith module, which now needs recurrent connections to itself and the other modules, there is again an adaptive predictor P,, which need not be recurrent. P,'s input at time t is the concatenation of

Jiirgen Schmidhuber

876

the outputs $ ( t ) of all units k # i. Pi’s one-dimensional output Py(t) is trained to equal the expectation of the output yi, given the outputs of the other units, E(y, 1 {yk(t).k # i } ) ,by defining Pi’s error function as

In addition, all units are trained to take on values that maximize

E=

C T(t) t

where T (t) is defined analogously to the respective stationary cases. The only way a unit can protect itself from being predictable from the other units is to store properties of the input sequences that are independent of aspects stored by the other units. In other words, this method will tend to throw away redundant temporal information much as the systems in Schmidhuber (1992a,b). For computing weight changes, each module looks back only to the last time step. In the on-line case, this implies an entirely local learning algorithm. Still, even when there are long time lags, the algorithm theoretically may learn unique representations of extended sequences-as can be seen by induction over the length of the longest training sequence: 1. y can learn unique representations of the beginnings ofall sequences. 2. Suppose all sequences and subsequences with length < k are uniquely rep-

resented in y. Then, by looking back only one time step at a time, y can learn unique representations of all subsequences with length k. The argument neglects all on-line effects and possible cross-talk. On-line variants of the system described above were implemented by Daniel Prelinger. Preliminary experiments were conducted with the resulting recurrent systems. These experiments demonstrated that there are entirely local sequence learning methods that allow for learning unique representations of all subsequences of nontrivial sequences (like a sequence consisting of 8 consecutive presentations of the same input pattern represented by the activation of a single input unit). Best results were obtained by introducing additional modifications (like other error functions than mean squared error for the representational modules). A future paper will elaborate on sequence learning by predictability minimization. 9 Concluding Remarks, Outlook

Although gradient methods based on predictability minimization cannot always be expected to find factorial codes-due to local minima and the

Learning Factorial Codes

877

possibility that the problem of finding factorial codes may be NP-hardthey have a potential for rcmoving kinds of redundancy that previous linear methods were not able to remove. This holds even if the conjecture in Section 5 ultimately proves to be false. In many realistic cases, however, approximations of nonredundant codes should be satisfactory. It remains to be seen whether predictability minimization can be useful to find nearly nonredundant representations of real-world inputs. In ongoing research it is intended to apply the methods described herein to problems of unsupervised image segmentation (in the case of multiple objects), as well as to unsupervised sequence segmentation. There is a relationship of predictability minimization to more conventional “competitive” learning schemes: In a certain sense, units compete for representing certain ”abstract” transformations of the environmental input. The competition is not based on a physical ”neighborhood” criterion but on mutual predictability. Unlike most previous schemes based on ”winner-take-all” networks, output representations formed by predictability minimization may have multiple “winners,” as long as they stand for independent features extracted from the environment. One might speculate about whether the brain uses a similar principle based on ”representational neurons” trying to escape the predictions of ”predictor neurons.” Sincc the principle allows for entirely local sequence learning algorithms (in space and time), it might be biologically more plausible than methods such as ”backpropagation through time” etc. Predictability minimimtion also might be useful in cases where different representational modules see different inputs. For instance, if a binary feature of one input ”patch” is predictable from features extracted from neighboring ”patches,” then representations formed by predictability minimization would tend to not use additional storage cells for representing the feature. The paper at hand adopts a general viewpoint on predictability minimization by focusing on the general case of nonlinear nets. In some cases, however, it might be desirable to restrict the computational power of the representational modules and/or the predictors by making them linear or semilinear. For instance, a hierarchical system with successive stages of computationally limited modules may be useful for reflecting the hierarchical structure of certain environments. Among the additional topics covered by the longer version of this report (Schmidhuber 1991) are general remarks on unsupervised learning and information-theoretic aspects, a “neural” approach to finding binary factorial codes 7oithoict using predictors, implementations of predictability minimization using binary stochastic units, the relationship of predictability minimization to recent sequence chunking methods, and combinations of goal-directed learning and unsupervised predictability minimization.

878

Jurgen Schmidhuber

Acknowledgments Thanks to Daniel Prelinger a n d Jeff Rink for implementing and testing the algorithms. Thanks to Mike Mozer, Daniel Prelinger, Radford Neal, Luis Almeida, Peter Dayan, Sue Becker, Rich Zemel, and Clayton McMillan for valuable comments and suggestions that helped to improve the paper. This research was supported by NSF PYI award IRI-9058450, Grant 9021 from the James S. McDonnell Foundation, and DEC external research Grant 1250 to Michael C. Mozer.

References Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Comp. 1, 412423. Becker, S. 1991. Unsupervised learning procedures for neural networks. Int. J. Neural Syst. 2(1 & 2), 17-33. Foldiiik, P. 1990. Forming sparse representations by local anti-hebbian learning. Biol. Cybernet. 64,165-170. Linsker, R. 1988. Self-organization in a perceptual network. ZEEE Comput. 21, 105-117. Oja, E. 1989. Neural networks, principal components, and subspaces. Znt. J. Neural Syst. 1(1), 61-68. Pearlmutter, B. A., and Hinton, G. E. 1986. G-maximization: An unsupervised learning procedure for discovering regularities. In Neural Networks for Computing: American lnstitute of Physics Conference Proceedings 252, J. S. Denker, ed., Volume 2, pp. 333-338. Morgan Kaufmann. Prelinger, D. Diploma thesis. 1992. Institut fur lnformatik, Technische Universitat Miinchen, in preparation. Rubner, J., and Schulten, K. 1990. Development of feature detectors by selforganization: A network model. Biol. Cybernet. 62, 193-199. Sanger, T. D. 1989. An optimality principle for unsupervised learning. In Advances in Neural Information Processing Systems I, D. s. Touretzky, ed., pp. 1119. Morgan Kaufmann, San Mateo, CA. Schmidhuber, J. H. 1991. Learning factorial codes by predictability minimization. Tech. Rep. CU-CS-565-91, Dept. of Comp. Sci., University of Colorado at Boulder, December. Schmidhuber, J. H. 1992a. Learning complex, extended sequences using the principle of history compression. Neural Comp. 4(2), 234-242. Schmidhuber, J. H. 1992b. Learning unambiguous reduced sequence descriptions. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 291-298. Morgan Kaufmann, San Mateo, CA. Shannon, C. E. 1948. A mathematical theory of communication (part 111). Bell System Tech. J. XXVII, 623-656.

Learning Factorial Codes

879

Silva, F. M., and Almeida, L. B. 1991. A distributed decorrelation algorithm. In Neural Netzuorks, Aduniices nrid Applications, Erol Gelenbe, ed. North-Holland, Amsterdam. Werbos, P. J. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University. Zemel, R. S., and Hinton, G. E. 1991. Discovering viewpoint-invariant relationships that characterize objects. In Aduarices in Naira/ Iiiforinatioii Proccssirig Systeins3, D. S. Lippman, J. E. Moody, and D. S. Touretzky, eds., pp. 299-305. Morgan Kaufmann, San Mateo, CA.

Received 2 January 1992; accepted 14 April 1992.

This article has been cited by: 2. Ella Bingham, Ata Kabán, Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Analysis and Applications 12:1, 55-78. [CrossRef] 3. Alexander S. Klyubin, Daniel Polani, Chrystopher L. Nehaniv. 2007. Representations of Space and Time in the Maximization of Information Flow in the Perception-Action LoopRepresentations of Space and Time in the Maximization of Information Flow in the Perception-Action Loop. Neural Computation 19:9, 2387-2432. [Abstract] [PDF] [PDF Plus] 4. Neil Burgess, Francesca Cacucci, Colin Lever, John O'Keefe. 2005. Characterizing multiple independent behavioral correlates of cell firing in freely moving animals. Hippocampus 15:2, 149-153. [CrossRef] 5. N.N. Schraudolph. 2004. Gradient-Based Manipulation of Nonparametric Entropy Estimates. IEEE Transactions on Neural Networks 15:4, 828-837. [CrossRef] 6. M.A. Sanchez-Montanes, F.J. Corbacho. 2004. A New Information Processing Measure for Adaptive Complex Systems. IEEE Transactions on Neural Networks 15:4, 917-927. [CrossRef] 7. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 8. Suzanna Becker, Mark Plumbley. 1996. Unsupervised neural network learning procedures for feature extraction and classification. Applied Intelligence 6:3, 185-203. [CrossRef] 9. Jürgen Schmidhuber, Martin Eldracher, Bernhard Foltin. 1996. Semilinear Predictability Minimization Produces Well-Known Feature DetectorsSemilinear Predictability Minimization Produces Well-Known Feature Detectors. Neural Computation 8:4, 773-786. [Abstract] [PDF] [PDF Plus] 10. Lucas Parra , Gustavo Deco , Stefan Miesbach . 1996. Statistical Independence and Novelty Detection with Information Preserving Nonlinear MapsStatistical Independence and Novelty Detection with Information Preserving Nonlinear Maps. Neural Computation 8:2, 260-269. [Abstract] [PDF] [PDF Plus] 11. J. Schmidhuber, S. Heil. 1996. Sequential neural text compression. IEEE Transactions on Neural Networks 7:1, 142-146. [CrossRef] 12. Anthony J. Bell , Terrence J. Sejnowski . 1995. An Information-Maximization Approach to Blind Separation and Blind DeconvolutionAn Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7:6, 1129-1159. [Abstract] [PDF] [PDF Plus]

13. Peter Dayan, Richard S. Zemel. 1995. Competition and Multiple Cause ModelsCompetition and Multiple Cause Models. Neural Computation 7:3, 565-579. [Abstract] [PDF] [PDF Plus] 14. David J. Field . 1994. What Is the Goal of Sensory Coding?What Is the Goal of Sensory Coding?. Neural Computation 6:4, 559-601. [Abstract] [PDF] [PDF Plus] 15. Jürgen Schmidhuber , Daniel Prelinger . 1993. Discovering Predictable ClassificationsDiscovering Predictable Classifications. Neural Computation 5:4, 625-635. [Abstract] [PDF] [PDF Plus]

Communicated by Halbert White

How to Incorporate New Pattern Pairs without Having to Teach the Previously Acquired Pattern Pairs H. M. Wabgaonkar A. R. Stubberud Electrical and Computer Engineering Department, University of California, lrvine, California 92717 U S A

In this paper, we deal with the problem of associative memory synthesis. The particular issue that we wish to explore is the ability to store new input-output pattern pairs without having to modify the path weights corresponding to the already taught pattern pairs. The approach to the solution of this problem is via interpolation carried out in a Reproducing Kernel Hilbert Space setting. An orthogonalization procedure carried out on a properly chosen set of functions leads to the solution of the problem. 1 Introduction

Associative memories can be implemented conveniently by employing neural networks as function approximators (Cybenko 1989; Stinchcombe and White 1989) or function interpolators (Wabgaonkar and Stubberud 1990). The problem of discrete associative memory synthesis is as follows. Given a finite discrete set of input-output pairs {(xi.yi)}, where the input patterns xi are in E" (the n-dimensional Euclidean space), and the corresponding outputs yi are in R'; the problem is to find an interpolating function f such the / ( x i ) = y;, for all i. It can be shown that a three-layer neural network with an input layer, a single hidden layer, and an output layer of processing units with linear characteristics can be employed to solve the above problem (Cybenko 1989; David 1975). The required function as realized by the network can be written as

f(x)=Ca,g;(x) x

E

E"

(1.1)

I

where g; : €"- > R' represents the characteristic or the activation function of the ith processing unit in the hidden layer. If the number of pattern pairs is N, then N functions gi must be chosen such that the matrix G = k,(xi)] is of maximal rank. If this condition is satisfied, the vector a of the coefficients { a j } can be obtained uniquely. In this paper, we are concerned with the problem of storing successive data pairs without having to alter the weights corresponding to the Neural Computation 4,880-887 (1992) @ 1992 Massachusetts Institute of Technology

Incorporation of New Pattern Pairs

881

previously stored data pairs for the net modeled by 1. Specifically, we require that the previously acquired coefficients a,, i = 1, . . . . K should not have to be modified for storing ”future pairs” ( x ~ ” . Y K + ~(xK+’. ) yK+2) etc. Imposing the interpolation conditions f ( x ’ ) = yI, i = 1 . . . . . K , from 1.1 we see that our objective will be achieved if the matrix G is lower-triangular, i.e.,

g , ( x ’ ) = 0.

for j > i

(1.2)

A convenient framework to solve the discrete associative memory synthesis problem by interpolation is the Reproducing Kernel Hilbert Space (RKHS) framework. Intuitively, our approach to the solution of the particular data storage problem is as follows. We define a kernel K ( x .y), x . y in E”, (called the reproducing kernel for reasons to be mentioned later), , (.. .) such that g,(x) = K ( x . k ) and K(x1, X I ) = ( K ( x .X I ) . K ( x . x ’ ) ) ~where indicates an appropriate inner product and the subscript x indicates the inner product to be taken with respect to the variable x . This allows us to convert the ”spatial” orthogonality condition of 1.2 into an orthogonality condition on the functions K ( . . x j ) , j = 1,.. . . N. Any standard orthogonalization technique such as the Gram-Schmidt procedure can then be applied to achieve the desired orthogonality of the successive neuronal activation functions 8,. In Section 2, we begin by briefly reviewing some of the relevant properties of RKHS. The approaches to the general interpolation problem and to the particular pattern storage problem are presented. A specific example of an RKHS presented in Section 3 is the Generalized Fock Space (GFS) framework originally proposed by de Figuereido and co-workers (1980, 1990). Finally, we present an example illustrating the proposed technique. 2 RKHS: A Brief Overview

In this section, we present a very brief overview of RKHS theory relevant to our work. The proofs of the general results related to RKHS can be found in Aronszajn (1950), Davis (1975), and de Figueiredo and Dwyer (1980); they will not be presented here. Next, we indicated how the required associative memory synthesis can be carried out via interpolation within an RKHS framework. This sets the stage for the particular orthogonality problem, which is also solved in this section. 2.1 Preliminaries. Let D be a compact subset (containing the points x i ) of E”; let H be a Hilbert space of continuous real-valued functions defined on D. Let (.. .) indicate a real-valued, symmetric inner product with respect to which H is complete. Since D is compact and the functions g of H are continuous, the functions g are also bounded. Therefore, each

H. M. Wabgaonkar and A. R. Stubberud

882

of the linear point evaluation functions L given by

L ( g ) = g(x). x E D.

vg E H

(2.1)

is also bounded. Hence, by the Frechet-Riesz Theorem, for each bounded linear functional L on H , there exists an element (a function) 1 of H , called the representer of the function L, given by

L ( g ) = (g. 1).

vg E H

(2.2)

Since each of the point evaluation functions on H is bounded, there exists a function of two variables K(x.y), with x . y in D, such that (Aronszajn 1950)

1. for each fixed y in D, K(x.y) considered as a function of x is in H; 2. for every function g in H, and for every y in D, the following reproducing property holds:

Due to property (2) above, K(...) is called (the) reproducing kernel (r.k.1 of the RKHS H. If a space has an r.k., it is unique. Also, in our case, K(...) happens to be a real-valued, symmetric function, i.e.,

K(x.y)

= K(y.x)

(2.4)

We note in passing that, although not usually recognized in this way, the familiar sinc function given by

is an r.k. for the space S of finite-energy, Fourier-transformable, band-limited signals (Yao 1967). This kernel is used for reconstructing a signal element of S from its samples. 2.2 Interpolation. The significant advantage of using an RKHS framework is the ability to extract the representers 1 of the functional L- the representer 1 of the functional L is given by

UY)

= L*[K(X.Y)l

(2.6)

where the subscript x indicates that L operates on K as if K were a function of the variable x alone. In particular, for the point evaluation functionals Li given by

L;V)=f(x');

xi in D

(2.7)

Incorporation of New Pattern Pairs

883

the corresponding representers I, are given by

l , ( x ) = K(x. x')

(2.8)

We intend to use these representer functions I, for interpolation. Specifiy,) such cally, we assume that we are given N input-output data pairs (1'. that x' are in D; x' are distinct. The RKHS H with the r.k. K is formed out of real-valued continuous functions on D. We are now in a position to make the following claim:

Claim: Under tkeassuinptions 011 D and H , tke matrix G = [l,(x')]is ofrna.uirna/ rank; where ll are tke representers gizien by 2.8. For the proof of the above claim refer to Wabgaonkar and Stubberud (1990) and de Figueiredo (1990). At this stage, it follows from 2.8 that the required function can be written as (see 1.1) N

f(x)=

ELI, K ( x . x');

xED

(2.9)

I=1

In view of the above claim, the coefficient vector a can be determined uniquely so as to satisfy the interpolation constraints stated earlier: f(x')

= y,.

i

=

1.. . . . N

(2.10)

For a properly chosen RKHS H , it is possible to choose (as we indicate in the next section),

K ( x .x') = h ( X T X 1 )

(2.11)

for a suitably defined function k : R' - > R'. With this choice, equation 2.9 is in complete conformity with equation 1.1 and with the three-layer neural net structure described in the previous section. 2.3 Orthogonality. Using equation 2.9, the orthogonality condition of 1.2 becomes

K(x1.x') = 0 for j > i

(2.12)

But, due to the property of the r.k., we have

K(x1. x ' )

=

( K ( x .x'). K ( x . XI)),

(2.13)

Therefore, we must have K ( . . x l ) orthogonal to K(..x') for j > i. This immediately suggests the use of the Gram-Schmidt procedure for orthogonalization. Recall that I, = K(..x'). Then, the following equations (Kreyszig 1978) describe the required procedure:

(2.14) (2.15) (2.16)

H. M. Wabgaonkar and A. R. Stubberud

884

The functions {g,}satisfy the required orthogonality property. Of course, the additional price to be paid for this property is the extra computation needed for the orthogonalization. From equations 2.12 and 2.15, it follows that llhI1l2= W')

(2.17)

This considerably simplifies the norm computations which are perhaps the most demanding of the computations in the algorithm. Finally, we note that the above procedure is data dependent. The possibility of employing symbolic computations for implementing the above procedure is currently being explored. 3 An RKHS Example

In this section, we consider a specific example of an RKHS called the Generalized Fock Space (GFS) due to de Figueiredo and co-workers (1980). Its application to the synthesis of the Hamming Net was proposed recently (de Figueiredo 1990). Here, we present it in a somewhat different context of the discrete associative memory synthesis problem. Finally, we give a numerical example. 3.1 The GFS. Suppose, the domain D (to which the inputs xi belong) is a compact subset of E", such that

D = { X E E";

1 1 ~ 1 <1 ~ M2.}

For future use, let p such that

(3.1)

= { p o ; p1 . . .} be

a sequence of positive real numbers

The function-elements g of the GFS H are real, analytical functions defined on D, such that

(3.3) The multi-index coefficients equations: g(x) =

" 1

C ,gi(x). ;=o .

ck

are defined through the following two

x E D

(3.4)

in which, the functions g, are given by (3.5)

885

Incorporation of New Pattern Pairs

In the above three equations we have used the multi-index notation as follows: (k,.k*. . . . .k,,) kl k2 . . . k,, kl!k2!. . . k,l!

k Ikl k!

=

ck

= Ckl.kZ. k,,

xk

=

= =

(3.6) (3.7) (3.8) (3.9) (3.10)

+ + +

(X,)k'(X2)k?

' ' '

(x,l)k"

The inner-product between the elements f and g of H is given by (3.11)

where ck are the coefficients corresponding to g and larly for f (see 3.3-3.5). The r.k. K(x.y) of H is given by

dk

are defined simi-

K(x.y) = P(XT.Y)

(3.12)

where (3.13)

Setting p,,

= po

for all n, the r.k. becomes

W , Y ) = ( I / P O ) exp(x'y)

(3.14)

It is straightforward to show that the reproducing property holds:

g(y) = (g(x).K(x.y)L

(3.15)

We will now work out an illustrative example in this setup. 3.2 Numerical Example. Let us consider the problem of 2-input EXOR function storage. Specifically, we wish to store the following four data pairs:

x'

y1 =

x*

y2 =

= (1,l). = (-l.-l). x3 = (13-l). x4 = (-1.1).

y3 =

y4 =

1 1 -1 -1

Within the GFS, the required interpolating function is given by 4

(3.16) in which we have chosen 1'0 = 1. By direct inversion of the corresponding G matrix, the coefficient vector u turns out to be a = 0.1810 (1.1. -1. -1).

886

H. M. Wabgaonkar and A. R. Stubberud

For the Gram-Schmidt procedure, we begin with the pair (xl.yl). The corresponding function g1 is given by gl(x) = exp(-1

+ x1 + x2)

(3.17)

and the corresponding coefficient a1 is exp(-1). Next, we assume that we have been given the second pair (x2.y2). With the above values of ul and gl, the next function 9 2 and the coefficient u2 are calculated using 2.14-2.16. The function g2 is given by

g2(x)=

+ + x2)

exp(-xl - x2) - exp(-4 x1 [exp(2)- exp(-6)]1/2

(3.18)

with u2 equaI to exp(-l)[(l - exp(-4))/(1 + e ~ p ( - 4 ) ) ] ' /Given ~. the third data-pair (.?, y3), the function 93 turns out to be g3(x)

=

+ + x2) + XI + x2)}]/b2

[exp(xl - x2) - exp(-2 x1 - exp(-4

-bl {exp(-xl - x2)

(3.19)

in which the constant bl is [l - exp(-4)]/[exp(2) - exp(-6)]; the constant b2 is [exp(6)- exp(2)]/[exp(4) 11. The corresponding coefficient u3 is

+

u3

=

-

{

exp(-6)

+ 21 + exp(-4)

11

1 +exp(4) exp(6) - exp(-2)

(3.20)

Finally, given (x4.y4), the function g4 can be shown to be

g4(x)

=

+ +

[exp(xl - x2) - exp(2 XI x2) - exp(2 - x1 - x2) + exp(4 - XI + xz)]/[exp(5)- exp(l)] (3.21)

and the coefficient u4 turns out to be

(3.22)

It can be seen that a1 = u2 = -u3 = -a4 = exp(-1) is a reasonable approximation. Similarly, in approximating the coefficients in the equations for the functions gi we can set exp(-t) to zero for t > 2. 4 Concluding Remarks

In this paper, we have presented a constructive approach to the solution of the discrete associative memory synthesis problem. The synthesis is carried out in an RKHS framework. A very desirable feature of this type of the setting is the ability to store the given successive pattern pairs

Incorporation of New Pattern Pairs

887

without having to retrain the previously acquired path weights. This attribute of the proposed approach is also demonstrated. Given a multivariate function defined on a nondiscrete domain, the choice of the projection directions a n d their number are not obvious. O u r current work is directed toward the extension of this approach to the storage of such a function defined on a nondiscrete domain.

References Aroszajn, N. 1950. Theory of reproducing kernels. Trmis. A m Math. SOC. 68, 337-404. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. CSRD Report no. 856, University of Illinois, Urbana, IL. Also published in Math. Control, Signals and Systems, 1989. Davis, I? J. 1975. Ziiterpolaticiri mid Approxiination, Dover, New York. de Figueiredo, R. J. I? 1990. An optimal matching score-net for pattern classification. Proc. IJCNN, San Diego, CA, pp. 909-916. de Figueiredo, R. J. I?, and Dwyer, T. A. W. 1980. A best approximation framework and implementation for simulation of large-scale non-linear systems. ZEEE Trails. Circuits S y s t . 27, 1005-1014. Kreyszig, E. 1978. Iiitroductory Firrictiorznl Arzahysis zclith Ayplicntioris, pp. 157-1 59. Wiley, New York. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. Proc. International Joint Conference on Neural Networks (IJCNN), Washington, D.C., pp. 613-617. Stubberud, A. R., and Wabgaonkar, H. 1990. Synthesis of discrete associative memories by multivariate interpolation. Tech. Rep. ECE Dept., Univ. of California-Irvine, Irvine, CA. Wabgaonkar, H., and Stubherud, A. 1990. Approximation and estimation techniques for neural networks. IEEE 1990 Conference on Decision and Control, Honolulu, December. Yao, K. 1967. Applications of reproducing kernal Hilbert spaces bandlimited signal models. Iiifo. Coil. 11, 429-444. ~

~

~~

Received 20 December 1990, ‘iccepted 19 March 1992.

This article has been cited by:

Communicated by Geoffrey Hinton

Local Learning Algorithms Leon Bottou Vladimir Vapnik AT&T Bell Laboratories, Holmdel, N J 07733 U S A

Very rarely are training data evenly distributed in the input space. Local learning algorithms attempt to locally adjust the capacity of the training system to the properties of the training set in each area of the input space. The family of local learning algorithms contains known methods, like the k-nearest neighbors method (kNN) or the radial basis function networks (RBF), as well as new algorithms. A single analysis models some aspects of these algorithms. In particular, it suggests that neither kNN or RBF, nor nonlocal classifiers, achieve the best compromise between locality and capacity. A careful control of these parameters in a simple local learning algorithm has provided a performance breakthrough for an optical character recognition problem. Both the error rate and the rejection performance have been significantly improved. 1 Introduction

Here is a simple local algorithm: For each testing pattern, (1) select the few training examples located in the vicinity of the testing pattern, ( 2 ) train a neural network with only these few examples, and (3) apply the resulting network to the

testing pattern. Such an algorithm looks both slow and stupid. Indeed, only a small part of the available training examples is used to train our network. Empirical evidence however defeats this analysis. With proper settings, this simple algorithm improves significantly the performance of our best optical character recognition networks. A few years ago, V. Vapnik devised a theoretical analysis for such local algorithms, briefly discussed in Vapnik (1992). This analysis introduces a new component, named locality, in the well-known trade-off between the capacity of the learning system and the number of available examples. This paper attempts to explain, and demonstrates, that such an algorithm might be very efficient for certain tasks, and that its underlying ideas might be used with profit. The voluminous equations of the theoretical analysis, however, will not be discussed in this paper. Their complexity, mostly related to the imperfection of the current generalization theories, would introduce unnecessary noise in this discussion. Neural Computation 4, 888-900 (1992)

@ 1992 Massachusetts Institute of Technology

Local Learning Algorithms

889

In Section 2, we show that handling rejections in some pattern recognition problems requires different properties for a learning device in different areas of the pattern space. In Section 3, we present the idea of a local learning algorithm, discuss related approaches, and discuss the impact of the locality parameter on the generalization performance. In Section 4, we demonstrate the effectiveness of a local learning algorithm on a real size optical character recognition task. 2 Rejection

An ideal isolated character recognition system always assigns the right symbolic class to a character image. But a real recognizer might commit an error or perform a rejection. Errors are very expensive to correct. A zipcode recognition system, for instance, might erroneously send a parcel to the other end of the world. Therefore, the system. should reject a pattern whenever the classification cannot be achieved with enough confidence. Having this pattern processed by a human being is usually less expensive than fixing an error. Selecting a proper confidence threshold reduces both the cost of handling the rejected patterns and the cost of correcting the remaining errors. The quality of such a pattern recognition system is measured by its rejection curve (cf. Fig. 4). This curve displays the possible compromises between the number of rejected patterns and the numbers of remaining errors. Two very different situations reduce the classification confidence and might cause a rejection (cf. Fig. 1). 0

0

Patterns might well be ambiguous. For instance, certain people write their "1" like other people write their "7." This cause of rejection is inherent to the problem. Ambiguities arise because important information, like the contextual knowledge of the writing style, is not provided as input to the system. Knowing exactly the probability distribution of the patterns and classes would not eliminate such rejections. A pattern would still be rejected if its most probable class does not win by a sufficient margin. Patterns might be unrelated to the training data used for defining the classifier. For instance, many atypical writing styles are not represented in the training database. Low probability areas of the pattern space are poorly represented in the training set. The decision boundary of our classifier in such areas is a mere side effect of the training algorithm. These boundaries are just irrelevant. This second cause of rejection is a direct consequence of the finite nature of the training set. Knowing exactly the probability distri-

890

Leon Bottou and Vladimir Vapnik

Rejection by ambiguity. The class of this pattern cannot be determined with enough confidence

Rejection by lack of data. We have not enough examples to find a decision boundary in that area with enough confidence.

Figure 1: This is a piece of an imaginary pattern space. Gray and black circles are examples of two classes. Thin lines are the actual Bayesian decision boundary between these two classes. Both crosses represent rejected patterns. bution of the pattern and classes would reveal the exact Bayesian decision boundaries everywhere. This latter cause of rejection has rarely been studied in the literature.' Its mere definition involves nonasymptotical statistics closely related to the generalization phenomenon. A high capacity learning system is able to model accurately and with high confidence the parts of the decision boundary that are well described by the training examples. In these areas, both rejections and misclassifications are rare. The same system however produces unreliable high confidence decision boundaries in the poorly sampled areas of the pattern space: Rejections are rare, but misclassifications are frequent. Alternatively, a low capacity learning system builds low confidence boundaries in the poorly sampled areas of the pattern space. This system rejects more atypical patterns, but reduces the number of misclassifications. Unfortunately, such a device performs poorly in the well sampled areas. Unable to take profit of the abundant data, it builds poor decision boundaries, and rejects almost everything, because everything looks too ambiguous. 'A Bayesian approach has been suggested in Denker and Le Cun (1991)for estimating error-bars on the outputs of a learning system. This useful information affects the interpretation of the outputs, and might improve the rejection performance, as suggested by our reviewer. This method could as well improve the rejection of local algorithms.

Local Learning Algorithms

891

In fact, different properties of the learning algorithm are required in different areas of the input space. In other words, the ”local capacity” of a learning device should match the density of training examples. 3 Local Learning Algorithms

It is now generally admitted that the generalization performance is affected by a global trade-off between the number of training examples and the capacity of the learning system. Various parameters monotonically control the capacity of a learning system (Guyon et al. 19921, including architectural parameters (e.g., the number of hidden units), preprocessing parameters (e.g., the amount of smoothing), or regularization parameters (e.g., the weight decay). The best generalization is achieved for some optimal values of these capacity controI parameters, which depend on the size of the training set. This fact holds for rejection or for raw performance, in the case of pattern recognition, regression, or density estimation tasks. Whenever the distribution of patterns in the input space is uneven, a proper local adjustment of the capacity can significantly improve the overall performance. Such a local adjustment requires the introduction of capacity control parameters whose impact is limited to individual regions of the pattern space. Here are two ways to introduce such parameters: 0

0

Our experiment (cf. Section 4) illustrates the first solution. For each testing pattern, we train a learning system with the training examples located in a small neighborhood around the testing pattern. Then we apply this trained system to the testing pattern itself. The parameters of the locally trained system d e facto affect only the capacity of the global system in the small neighborhood defined around each testing pattern. We shall show below that the k-nearest-neighbor (kNN) algorithm is just a particular case of this approach. Like kNN, such systems are very slow. The recognition speed is penalized by the selection of the closest training patterns and by the execution of the training algorithm of the local learning system. In the second solution, the structure of the learning device ensures that each parameter affects the capacity of the system in a small neighborhood only. For example, we might use a separate weight decay per radial basis function (RBF) unit in an RBF network. Each weight decay parameter affects the capacity of the network only locally. Similarly, architectural choices in a modular network (Jacobs ct nl. 1991) have a local impact on the capacity of the global system. Since such a system is trained at once, the recognition time is not affected by the local nature of the learning procedure.

892

Leon Bottou and Vladimir Vapnik

In the next subsections, we shall describe a general statement for local learning algorithms, discuss related approaches, and explain how the locality parameter affects the generalization performance. 3.1 Weighted Cost Function. We present here a general statement for local learning algorithms. Let us define J[y.fil,(x)] as the loss incurred when the network gives an answer fw(x) for the input vector x, when the actual answer is y. The capacity control parameters y directly or indirectly define a subset W, of the weight space (Guyon et al. 1992). A nonlocal algorithm searches this subset for the weight vector w*(y), which minimizes the empirical average of the loss over a training set (XI.yl ). . . . . (XI.yl). (3.1)

For each neighborhood defined around a point xo, a local algorithm searches for a weight vector w*(xo. b. y), which minimizes a weighted empirical average of the loss over the training set.

The weighting coefficients are defined by a kernel K ( x - x o , b ) of width b centered on point xo. Various kernels might be used, including square kernels and smooth kernels (cf. Fig. 2).

Figure 2 A square kernel selects only those examples located in a specific neighborhood. A smooth kernel gives a different weight to all examples, according to their position with respect to the kernel. The locality parameter, b, measures the "size" of the neighborhood.

-

Local Learning Algorithms

893

Since a separate minimization is performed in each neighborhood, the capacity control parameters -, and the kernel width b can be adjusted separately for each neighborhood. 3.2 Related Approaches. This formulation allows for many variations concerning the class of function f n , ( x ) ,the number of neighborhoods, the shape of the kernels K(x Xg.b ) , the scaling laws for the kernel width b, or the parameters 7. Selecting the class of constant functions with respect to the input vectors x , and using a quadratic loss J(y. y) = (y-5)' leads to several popular algorithms, like the kNN method or the RBF networks. In each specific neighborhood, such algorithms try to find a constant approximation 5' of the desired output: ~

y' = Arg Min ii

1 ' K ( x - xo. b ) (yl - 5)' 1 I= I

-

1

(3.3)

For instance, consider a pattern recognition problem. If the pattern XI belongs to the tzth class, the nth coefficient of the corresponding desired output GI is equal to 1; the others coefficients are equal to 0. 0

For each testing pattern xo, we consider a square kernel whose width is adjusted to contain exactly k examples. The optimum y of (3.3) is the mean of the desired outputs of the k closest patterns. Its highest coefficient, then, corresponds to the most represented class among the k closest patterns to xo. This is the k-izenrest neighbors (kNN) algorithm.

If we use a smooth kernel instead of a square kernel, minimizing (3.3) for each testing pattern xo computes estimates of the posterior probability of the classes. This is the Pnrzen windows algorithm. 0

We consider now R fixed neighborhoods, defined by the centers x: and the standard deviation b,+ of their gaussian kernels. Minimizing (3.3) in each neighborhood computes the weighted average y: of the desired values of the training examples. To evaluate the output yGl&,al(x)of the complete system for an input pattern x , we merge these weighted averages according to the values of the X kernels on x. (3.4) This is a radial basis functions (RBF) network (Broomhead and Lowe 1988; Moody and Darken 1989).

Leon Bottou and Vladimir Vapnik

894

3.3 Locality and Capacity. Theoretical methods developed for nonlocal algorithms apply to each local optimization. In particular, the best value for the capacity control parameters (Guyon et al. 1992) depend on the number of training examples. In the context of a local algorithm, however, the effective number of training examples is modulated by the width b of the kernels. For instance, a square kernel selects from the training set a subset, whose cardinality depends on the local density of the training set and on the kernel width b. The classical trade-off between capacity and number of examples must be reinterpreted as a trade-off between capacity and locality. If we increase the locality by reducing b, we implicitly reduce the effective number of training examples available for training the local system. kNN and the RBF networks use small kernels and a class of constant functions. There is no reason, however, to believe that the best results are obtained with such a low capacity device. Conversely, big multilayer networks are nonlocal (b = m), but have a high capacity. Modular networks (Jacobs et al. 1991) sit somewhat between these two extremes. The kernel functions are embodied by a "gating network," which selects or combines the outputs of the modules, according to the input data and sometimes the outputs of the modules themselves. Another phenomenon makes the situation slightly more complex: The capacity control parameters can be adjusted separately in each neighborhood. This local adjustment is more accurate when the kernel width is small. On the other hand, there is little to adjust in a very low capacity device.

4 Experiments

This section discusses experiments of a simple local learning algorithm on a real size pattern recognition problem. Comparisons have been carried out (1) with a backpropagation network, and (2) with the kNN and Parzen windows algorithms. 0

0

A backpropagation network is a nonlocal algorithm, with a comparatively high capacity. Comparison (1) shows that introducing locality and reducing the capacity improves the resulting performance of such a network. A kNN classifier is an extremely local algorithm, with a very low capacity. Comparison (2) shows that reducing the locality and increasing the capacity again improves the resulting performance.

4.1 A Simple Local Learning Algorithm. We have implemented a simple local algorithm:

Local Learning Algorithms

895

For each testing pattern so,a linear classifier is trained on the k closest , the Euclidian distance. This training examples, ( x I . y l1 . . . . . ( x ~ . y k ) for trained network then is applied to the testing pattern xg. The effective number of examples, k, is much smaller than the number of weights in the linear classifier. Therefore, a strong weight decay *, is required to reduce the capacity of the linear classifier. A weight decay, however, pulls the weights toward some arbitrary origin. For isotropy reasons, the origin of the input space is translated on the testing pattern xo,by subtracting xo from all the selected training patterns. This also has the advantage of reducing the eigenvalue spread of the hessian matrix. The training procedure computes the explicit minimum of a quadratic cost incorporating a weight decay term. The positive weight decay , ; ensures that the required matrix inversion is possible.

Since the new coordinate system is centered on the testing pattern, the output of the network on this testing pattern is equal to the bias vector. The highest output determines which class is selected for pattern xo. If, however, the difference between the highest output and the second highest output is less than a certain threshold, the pattern is rejected. A simple heuristic rule controls the capacity versus locality trade-off within each neighborhood. It adjusts the locality and leaves the capacity constant. In other words, the same value of the weight decay parameter 7 is used in all neighborhoods. The locality is usually controlled by a kernel size b, which should increase when the density of training example decreases. In fact, selecting the k closest training examples is equivalent to having a square kernel whose size is somewhat adjusted according to the local density of training examples. We just use the same value for k in all neighborhoods. Although this system is extremely inefficient, it implements a wide range of compromises between locality and capacity, according to the values of only two parameters, a locality parameter k and a regularization parameter 7 . We have found that this quality was attractive for a first experiment. 4.2 Results and Comparisons. We trained several systems on the same training set composed of normalized 16 x 16 images of 7291 handwritten digits and 2549 printed digits. Performance has been measured on a test set of 2007 handwritten digits. The same database was used in Le Cun et a/. (1990). Table 1 gives the raw error and the rejection at 1% error for various systems. The ”raw error” is the percentage of misclassifications when no rejection is performed. The “rejection for 1% error” is the percentage

896

LPon Bottou and Vladimir Vapnik

0I 2 3 4 5 6 7 89 n....

.

rpys v:

10 clauiflcm'oa lvliu

Figure 3: "LeNet": Layers i, ii, iii, and iv compute 192 features. Layer v performs the classification. In this paper, we replace layer v by a local algorithm. Table 1: Results on an Optical Character Recognition Task.

Raw error Human LeNet kNN Parzen Local

(on segmented digits) zz 2.5% (on segmented digits) 5.1% (on LeNet features) 5.1% (on LeNet features) 4.7% (on LeNet features) 3.3%

Rejection for 1%error n.a. 9.6% n.a. 10.8% 6.2%

of rejected pattern when the rejection threshold is adjusted to allow less than 1% misclassification on the remaining patterns. The 2.5% human performance on the segmented and preprocessed digits provides a reference point (Sackinger and Bromley, personal communication). The nickname "LeNet" designates the network described in Le Cun et al. (1990). This five-layer network performs both the feature extraction and the classification of 16 x 16 images of single handwritten digits. Four successive convolutional layers extract 192 translation invariant features; a last layer of 10 fully connected units performs the classification (cf. Fig. 3 ) . This network has achieved 5.1% error and 9.6% rejection for 1% error.* 'This is slightly worse than the 4.6%raw error and 9% rejection for 1%error reported in Le Cun et nl. (1990). This is due (1) to a slightly different definition of the rejection performance (1%error on the remaining patterns vs. 1% error total), and (2) to a more robust preprocessing code.

Local Learning Algorithms

897

3j5 4 4

t \\

..

1

0 0

2I 2

4,\-%-,6 4 6

8 8

10 12 14 16 18 20 22 24 10 12 14 16 18 20 22 24

Rejection (96)

Figure 4: Punting curve. This curve shows the error rates at various rejection rates for plain LeNet (dashed curve) and our simple local algorithm operating on the features computed by LeNet (plain curve).

The 192 features computed by LeNet are used as inputs to three other pattern recognition systems. These systems can be viewed as replacements for the last layer of LeNet. Therefore, results can be directly compared. The best kNN performance, 5.1%raw error, was obtained by using the three closest neighbors only. These few neighbors allow no meaningful rejection strategy. The Parzen system is similar to k".We just replace the square kernel by a gaussian kernel, whose standard deviation is half the distance of the fourth closest pattern. Several variations have been tested; we report the best results only: 4.7% raw error and 10.8% rejection for 1% error. Finally, we have tested the simple local learning algorithm described above, using the 200 closest patterns and a weight decay of 0.01. This weight decay is enough to train our 1920 weights using 200 patterns. At this time, both the 3.3% raw error and the 6.2% rejection rate for 1%error were the best performances reached on this data set. A derivation reported in the appendix shows that this performance improvement is statistically significant; Figure 4 compares the rejection curve of the local system with the rejection curve of "LeNet." The local system performs better for all values of the threshold. At 17% rejection, the single remaining error is a mislabeled pattern.

898

Leon Bottou and Vladimir Vapnik

2.5

Decay

Figure 5: Evolutionsof the raw error for the local system around the best values of k and of the weight decay 7 . (In fact, the decay axis displays the product rk.)

With a proper adjustment of its locality and capacity parameters, this simple algorithm outperforms both (1) a nonlocal algorithm (i.e., the last layer of LeNet), and (2) two extremely local algorithms (i.e., K" or Parzen windows). Figure 5 shows how the raw error of the local system changes around the best values of k and of the weight decay 7. Finally, no significant performance changes have been obtained by using smooth kernels or fancy heuristic for controlling the kernel width and the weight decay.

4.3 Recognition Speed. This simple system, however, spends 50 sec for recognizing a single digit. Training a network for each testing pattern is certainly not a practical approach to the optical character recognition problem. In Section 3, however, we have presented two solutions for building local learning algorithms. We have deliberately chosen to implement the simpler one, which leads to very slow recognizers. We could as well design systems based on our second solution, i.e., using a network structure that allows a local control of the capacity of the system. Such a system would be slightly more complex to handle, but would have a much shorter recognition time.

Local Learning Algorithms

899

5 Conclusion No particular architectural change is responsible for this performance breakthrough. In fact, this system is linear, and replaces a linear decision layer in LeNet. The change concerns the training procedure. The performance improvement simply results from a better control of the basic trade-offs involved in the learning process. Although much remains to be understood about learning, we have some practical and theoretical knowledge of these basic trade-offs. Understanding how these trade-offs affect a specific learning problem often allows us to take profit from their properties for practical applications. Local learning algorithms are just a successful example of this strategy.

Appendix: Confidence Intervals This section presents the derivations establishing the significance of the results presented in this paper. We first derive a nonasymptotical formula for computing confidence when comparing two classifiers on a same test set of N independent examples. Each classifier defines certain decision boundaries in the pattern space. It is enough to compare our classifiers on only those cases where one classifier is right and the other one is wrong. Let us call pl and p 2 the conditional probabilities of error of each classifier, given that one classifier only gives a wrong answer. Similarly, let us define H I and 112 as the numbers of errors that each classifier makes which the other classifier classifies correctly, and n12 as the number of common errors. According to the large number inequality (2.7) in Hoeffding (1963),

Furthermore, if we name v1 and y the measured error rates on our test set, we have n l - n2

=

(nl + n12)

-

(n2

+ n12) = N ( Y I

-

v2)

(A.2)

By solving for f when the right-hand side of inequality (A.1) is 1 - r / , we can compute the minimum difference v1 - 112, which ensures that p1 - p2 is larger than 0 with probability 1 - 11. Since comparing p1 and p2 is enough to decide which classifier is better, the following result is valid: If

then classifier 2 is better than classifier 1 with probability 1 - I / . In our case, all systems achieve less than 5.1% error on a test set of size N = 2007. The quantity nl + n2 is thus smaller than 10.2% of N,

900

Leon Bottou and Vladimir Vapnik

probably by a large margin. If we choose 71 = 5%, we get a minimal significative error difference of 1.2%. Measuring the actual value of nl n2, would further reduce this margin. The significance of the results presented in this paper, however, is established without such a refinement.

+

Acknowledgments We wish to thank Larry Jackel's group at Bell Labs for their continuous support and useful comments. We are especially grateful to Yann Le Cun for providing networks and databases, to E. Sackinger and J. Bromley for providing the human performance results.

References Broomhead, D. S., and Lowe, D. 1988. Multivariate functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Denker, J. S., and Le Cun, Y. 1991. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems 3 (NIPS*90), Lippmann, R., Moody, J., and Touretzky, D., eds. Morgan Kaufmann, Denver. Guyon, I., Vapnik, V. N., Boser, 8. E., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing Systems, Vol. 4. Morgan Kaufmann, Denver. Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Am. Statist. Assoc. 58, 13-30. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixture of local experts. Neural Comp. 3(1), 79-87. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, D. Touretzky, ed., Vol. 2. Morgan Kaufmann, Denver. Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Vapnik, V. N. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, Vol. 4. Morgan Kaufmann, Denver. To appear.

Received 19 February 1992; accepted 3 April 1992.

This article has been cited by: 2. Nicola Segata, Enrico Blanzieri, Sarah Jane Delany, Pádraig Cunningham. 2010. Noise reduction for instance-based learning with a local maximal margin approach. Journal of Intelligent Information Systems 35:2, 301-331. [CrossRef] 3. Nicola Segata, Enrico Blanzieri. 2010. Operators for transforming kernels into quasi-local kernels that improve SVM accuracy. Journal of Intelligent Information Systems . [CrossRef] 4. Filip Ponulak, Andrzej Kasiński. 2010. Supervised Learning in Spiking Neural Networks with ReSuMe: Sequence Learning, Classification, and Spike ShiftingSupervised Learning in Spiking Neural Networks with ReSuMe: Sequence Learning, Classification, and Spike Shifting. Neural Computation 22:2, 467-510. [Abstract] [Full Text] [PDF] [PDF Plus] 5. Petr Kadlec, Bogdan Gabrys. 2010. Local learning-based adaptive soft sensor for catalyst activation prediction. AIChE Journal n/a-n/a. [CrossRef] 6. Petr Kadlec, Bogdan Gabrys. 2009. Architecture for development of adaptive on-line prediction models. Memetic Computing 1:4, 241-269. [CrossRef] 7. Isaac Chairez. 2009. Wavelet Differential Neural Network Observer. IEEE Transactions on Neural Networks 20:9, 1439-1449. [CrossRef] 8. G. Polzlbauer, T. Lidy, A. Rauber. 2008. Decision Manifolds—A Supervised Learning Algorithm Based on Self-Organization. IEEE Transactions on Neural Networks 19:9, 1518-1530. [CrossRef] 9. Chen-Chia Chuang. 2007. Fuzzy Weighted Support Vector Regression With a Fuzzy Partition. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:3, 630-640. [CrossRef] 10. Carlotta Domeniconi, Dimitrios Gunopulos, Sheng Ma, Bojun Yan, Muna Al-Razgan, Dimitris Papadopoulos. 2007. Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14:1, 63-97. [CrossRef] 11. J. Cho, J.C. Principe, D. Erdogmus, M.A. Motter. 2006. Modeling and Inverse Controller Design for an Unmanned Aerial Vehicle Based on the Self-Organizing Map. IEEE Transactions on Neural Networks 17:2, 445-460. [CrossRef] 12. R. Kulhavy. 2003. A developer's perspective of a decision support system. IEEE Control Systems Magazine 23:6, 40-49. [CrossRef] 13. C. Domeniconi, Jing Peng, D. Gunopulos. 2002. Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:9, 1281-1285. [CrossRef] 14. Agnes Schumann, Niels Wessel, Alexander Schirdewan, Karl Josef Osterziel, Andreas Voss. 2002. Potential of feature selection methods in heart rate variability analysis for the classification of different cardiovascular diseases. Statistics in Medicine 21:15, 2225-2242. [CrossRef]

15. N. Giusti, A. Sperduti. 2002. Theoretical and experimental analysis of a two-stage system for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:7, 893-904. [CrossRef] 16. D. Spiegel, T. Sudkamp. 2002. Employing locality in the evolutionary generation of fuzzy rule bases. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:3, 296-305. [CrossRef] 17. N.K. Kasabov, Qun Song. 2002. DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems 10:2, 144-154. [CrossRef] 18. E.A. Rying, G.L. Bilbro, Jye-Chyi Lu. 2002. Focused local learning with wavelet neural networks. IEEE Transactions on Neural Networks 13:2, 304-319. [CrossRef] 19. Patrice Y. Simard, Yann A. Le Cun, John S. Denker, Bernard Victorri. 2000. Transformation invariance in pattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and Technology 11:3, 181-197. [CrossRef] 20. Rudolf Kulhavý, Petya Ivanova. 1999. Quo vadis, Bayesian identification?. International Journal of Adaptive Control and Signal Processing 13:6, 469-485. [CrossRef] 21. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus] 22. Arnaud Buhot, Juan-Manuel Torres Moreno, Mirta B. Gordon. 1997. Finite size scaling of the Bayesian perceptron. Physical Review E 55:6, 7434-7440. [CrossRef] 23. E. Alpaydin, M.I. Jordan. 1996. Local linear perceptrons for classification. IEEE Transactions on Neural Networks 7:3, 788-794. [CrossRef] 24. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus] 25. Federico Girosi , Michael Jones , Tomaso Poggio . 1995. Regularization Theory and Neural Networks ArchitecturesRegularization Theory and Neural Networks Architectures. Neural Computation 7:2, 219-269. [Abstract] [PDF] [PDF Plus] 26. David G. Lowe . 1995. Similarity Metric Learning for a Variable-Kernel ClassifierSimilarity Metric Learning for a Variable-Kernel Classifier. Neural Computation 7:1, 72-85. [Abstract] [PDF] [PDF Plus] 27. V. Vapnik , L. Bottou . 1993. Local Algorithms for Pattern Recognition and Dependencies EstimationLocal Algorithms for Pattern Recognition and Dependencies Estimation. Neural Computation 5:6, 893-909. [Abstract] [PDF] [PDF Plus]

-~

Communicated by Dana Ballard

Object Discrimination Based on Depth-from-Occlusion Leif H. Finkel Paul Sajda Department of Bioengineering orid Znsfitirte of Neiirologicnl Sciences, University of Pennsylvania, Piiil~idelphia,PA 29104-6392 USA We present a model of how objects can be visually discriminated based on the extraction of depth-from-occlusion. Object discrimination requires consideration of both the binding problem and the problem of segmentation. We propose that the visual system binds contours and surfaces by identifying "proto-objects"-compact regions bounded by contours. Proto-objects can then be linked into larger structures. The model is simulated by a system of interconnected neural networks. The networks have biologically motivated architectures and utilize a distributed representation of depth. We present simulations that demonstrate three robust psychophysical properties of the system. The networks are able to stratify multiple occluding objects in a complex scene into separate depth planes. They bind the contours and surfaces of occluded objects (for example, if a tree branch partially occludes the moon, the two "half-moons" are bound into a single object). Finally, the model accounts for human perceptions of illusory contour stimuli. 1 Introduction

___

In order to discriminate objects in the visual world, the nervous system must solve two fundamental problems: binding and segmentation. The binding problem (Barlow 1981) addresses how the attributes of an object-shape, color, motion, depth-are linked to create an individual object. Segmentation deals with the converse problem of how separate objects are distinguished. These two problems have been studied from the perspectives of both computational neuroscience (Marr 1982; Grossberg and Mingolla 1985; T. Poggio et al. 1988; Finkel and Edelman 1989) and machine vision (Guznian 1968; Rosenfeld 1988; Aloimonos and Shulman 1989; Fisher 1989). However, previous studies have not addressed what we consider to be the central issue: how does the visual system define an object-i.e., what constitutes a "thing." Object discrimination occurs at an intermediate stage of the transformation between two-dimensional (2D) image intensity values and visual recognition, and in general, depends on cues from multiple visual modalities. To simplify the problem, we restrict ourselves to discrimiNeltrril Cornpututiorz 4, 901-921 (1992)

@ 1992 Massachusetts Institute of Technology

Leif H. Finkel and Paul Sajda

902

nation based solely on occlusion relationships. In a typical visual scene, multiple objects may occlude one another. When this occurs, it creates a perceptual dilemma-to which of the two overlapping surfaces does the common border belong? If the border is, in fact, an occlusion border, then it belongs to the occluding object. This identification results in a stratification of the two objects in depth and a de facto discrimination of the objects. Consider the case of a tree branch crossing the face of the moon. We perceive the branch as closer and the moon more distant, but in addition, the two "half-moons" are perceptually linked into one object. The visual system supplies a virtual representation of the occluded contours and surfaces in a process Kanizsa (1979) has called "amodal completion." With this example in mind, we propose that the visual system identifies "proto-objects" and determines which proto-objects, if any, should be linked into objects. For present purposes, a proto-object is defined as a compact region surrounded by a closed, piecewise continuous contour and located at a certain distance from the viewer. The contour can be closed on itself, or more commonly, it can be closed by termination on other contours. We will demonstrate how a system of interconnected, physiologically based neural networks can identify proto-objects, link them into objects, and stratify the objects in depth. The networks operate, largely in parallel, to carry out the following interdependent processes: 0

discriminate edges

0

segment and bind contours

0

identify proto-objects (i.e., bind contours and surfaces)

0

identify possible occlusion boundaries

0

stratify occluding objects into different depth planes

0

attempt to link proto-objects into objects

0

influence earlier steps (e.g., contour binding) by results of later steps (e.g., object linkage).

The constructed networks implement these processes using a relatively small number of neural mechanisms (such as detecting curvature, and determining which surface is inside a closed contour). A few of the mechanisms used are similar to those of previous proposals (Grossberg and Mingolla 1985; Finkel and Edelman 1989; Fisher 1989). But our particular choice of mechanisms is constrained by two considerations. First, we utilize a distributed representation of depth-this is based on the example of how disparity is represented in the visual cortex (G. Poggio et al. 1988; Lehky and Sejnowski 1990). The relative depth of a particular object is represented by the relative activation of corresponding units in a foreground and background map. Second, as indicated above, we make

Object Discrimination

903

extensive use of feedback (reentrant) connections from higher level networks to those at lower levels-this is particularly important in linking proto-objects. For example, once a higher level network has determined an occlusion relationship it can modify the way in which an earlier network binds contours to surfaces. Any model of visual occlusion must be able to explain the perception of illusory (subjective) contours, since these illusions arise from artificially arranged cues to occlusion (Gregory 1972). The proposed model can account for the majority of such illusions. In fact, the ability to link contours in the foreground and background corresponds, respectively, to the processes of modal and amodal completion hypothesized by Kanizsa (1979). The present proposal differs from previous neural models of illusory contour generation (Ullman 1977; Grossberg and Mingolla 1985; von der Heydt et al. 1989; Finkel and Edelman 1989) in that it generates illusory objects-not just the contours. The difference is critical: a network which generates responses to the three sides of the Kanizsa triangle, for example, is not representing a triangle (the object) per se. To represent the triangle it is necessary to link these three contours into a single entity, to know which side of the contour is the inside, to represent the surface of the triangle, to know something about the properties of the surface (its depth, color, texture, etc.), and finally to bind all these attributes into a whole. This is clearly a much more difficult problem. We will describe, however, a simple model for how such a process might be carried out by a set of interconnected neural networks, and present the results of simulations that test the ability of the system on a range of normal and illusory scenes.

2 Implementation

Simulations of the model were conducted using the NEXUS Neural Simulator (Sajda and Finkel 1992). NEXUS is an interactive simulator designed for modeling multiple interconnected neural maps. The simulator allows considerable flexibility in specifying neuronal properties and neural architectures. The present simulations feature an interconnected system composed of 10 different network architectures, each of which contains one or more topographically organized arrays of 64 x 64 units. Two types of neuronal units are used. Standard neuronal units carry out a linear weighted summation of their excitatory and inhibitory inputs, and outputs are determined by a sigmoidal function between voltage and firing rate. NEXUS also allows the use of more complex units called PGN (programmable generalized neural) units that execute arbitrary functions or algorithms. A single PGN unit can emulate the function of a small circuit or assembly of standard units. PGN units are particularly useful in situations in which an intensive computation is being performed but the anatomical and physiological details of how the operation is performed in uiuo are unknown. Alterna-

Leif H. Finkel and Paul Sajda

904

Dlscrlmlnate Edges, Line

I DatermineConHnuily and Closure and DynamicallyBind Contour

I Determbe II Contour is Surrounded

Cumtun

t Determine RelaHve Depth of Object Using Distributed Representstlonin FOREGROUND and BACKGROUND Maps

Figure 1: Major processing stages in the model. Each process is carried out by one or more networks. Following early visual stages, information flows through two largely parallel pathways-one concerned with identifying and linking occlusion boundaries (left side) and another concerned with stratifying objects in depth (right side). Networks are multiply interconnected and note the presence of the two reentrant feedback pathways. tively, PGN units can be used to carry out functions in a time-efficient manner; for example, to implement a one-step winner-take-all algorithm. The PGN units used in the present simulations can all be replaced with circuits composed of standard neuronal units, but this incurs a dramatic increase in processing time and memory allocation with minimal changes in functional behavior at the system level. No learning is involved in the network dynamics. The model is intended to correspond to visual processing during a brief interval (less than 200 msec following stimulus presentation), and the interpretation of even complex scenes requires only a few cycles of network activity. The details of network construction will be described elsewhere; we will focus here on the processes performed and the theoretical issues behind the mechanisms. 3 Construction of the Model

The model consists of a number of stages as indicated in Figure 1. The first stage of early visual processing involves networks specialized for the

Object Discrimination

905

detection of edges, line orientation, and line terminations (endstopping). As Ramachandran (1987) observed, the visual system must distinguish several different types of edges: we are concerned here with the distinction between edges due to surface discontinuities (transitions between different surfaces) and those due to surface markings (textures, stray lines, etc.). Only the former can be occlusion boundaries. The visual system utilizes several modalities to classify types of edges; we restrict ourselves to a single process carried out by the second processing stage, a network that determines which segments belong to which contours and whether the contours are closed. When two contours cross each other, forming an "X" junction, there are several possible perceptual interpretations of which arms of the "X" should be joined. Our networks carry out the simple rule that discontinuities should be minimized-i.e., lines and curves should continue as straight (or with as much the same curvature) as possible. Similar assumptions underlie previous models (Ullman 1977), and this notion is in accord with psychophysical findings that discontinuities contain more information than continuous segments (Attneave 1954; Resnikoff 1989). We are thus minimizing the amount of self-generated information. We employ a simple sequential process to determine whether a contour is closed-each unit on a closed contour requires that at least two of its nearest neighboring units also be on the contour. It is computationally difficult to determine closure in parallel. We speculate that, iiz uiuo, the process is carried out by a combination of endstopped units and largereceptive field cells arranged in an architecture similar to that described in Area 17 (Rockland and Lund 1982; Mitchison and Crick 1982; Gilbert and Wiesel 1989). Once closure is determined, it is computationally efficient for the units involved to be identified with a "tag." Several of the higher level processes discussed below require that units responding to the same contour be distinguishable from those responding to different contours. There are several possible physiological mechanisms that could subserve such a tag-one possible mechanism is phase-locked firing (Gray and Singer 1989; Eckhorn et al. 1988). We have implemented this contour binding tag through the use of PGN units (Section 2), which are capable of representing several distinct tags. It must be emphasized, however, that the model is compatible with a number of possible physiological mechanisms. Closed contours are a necessary condition to identify a proto-object, but sufficiency requires two additional components. As shown in Figure 1, the remaining determinations are carried out in parallel. One stage is concerned with determining on which side of the contour the figure lies, i.e., distinguishing inside from outside. The problem can be alternatively posed as determining which surface "owns" the contour (Koffka 1935; Nakayama and Shimojo 1990). This is a nontrivial problem that, in general, requires global information about the figure. The classic example is the spiral (Minsky and Papert 1969; Sejnowski and Hinton 1987) in

Leif H. Finkel and Paul Sajda

906

/Op7/ I

1

contour

Figure 2: Neural circuit for determining direction of figure (inside vs. outside). Hypothetical visual stimulus consists of two closed contours (bold curves). The central unit of 3 x 3 array (shown below) determines the local orientation of the contour. Surrounding units represent possible directions (indicated by arrows) of the inside of the figure relative to the contour. All surrounding units are inhibited (black circles) except for the two units located perpendicular to local orientation of the contour. Units receive inputs from the contour binding map via dendrites that spread out in a stellate configuration, as indicated by clustered arrows (dendrites extend over long distances in map). Units inside the figure will receive more inputs than those located outside the figure. The two uninhibited units compete in a winner-take-all interaction. Note that inputs from separate objects are not confused due to the tags generated in the contour binding map. which it is impossible to determine whether a point is inside or outside based on only local information. The mechanism w e employ, as shown in Figure 2, is based on the following simple observation. Suppose a unit projects its dendrites in a stellate configuration and that the dendrites are activated by units responding to a contour. Then units located inside a closed contour will receive more activation than units located outside the contour. A winner-take-all interaction between the two units will

Object Discrimination

907

concavity

I “‘1;hq I

“I

T

3

I I I

Figure 3: Primary cues for occlusion. Tag junctions (shown in the inset) signal a local discontinuity between occluding and occluded contours. Concave regions and surrounded contours suggest occlusion, but are not as reliable indicators as tag junctions. Additional cues such as accretion/deletion of texture (not considered here) are used i l l 77iuo.

determine which is more strongly activated, and hence which is inside the figure. As shown in Figure 2, it is advantageous to limit this competition to the two units that are located at positions perpendicular to the local orientation of the contour. As will be shown below (see Figs. 5-71, this network is quite efficient at locating the interior of figures. It also demonstrates deficiencies similar to those of human perception-for example, it cannot distinguish the inside from the outside of a spiral. The mechanism depends on the contour binding carried out above. Each unit only considers inputs with the appropriate tag-in this way, inputs from separate contours in the scene are not confused. Identification of a proto-object also requires that the relative depth of the surface be determined. This is carried out chiefly through the use of tag junctions. As shown in Figure 3, a tag junction is formed by the termination of an occluded boundary on an occluding boundary. Tag junctions generally correspond to T-junctions in the image, however, they arive from discontinuities in the binding tags and are therfore associated with surface discontinuities as well. Note that tag junctions are identified at an intermediate stage in the sytem (see Fig. 1) and are not

908

Leif H. Finkel and Paul Sajda

constructed directly from end-stopped units in early vision. This accords with the lack of physiological evidence for "junction" detectors in striate cortex. In this model, tag junctions serve as the major determinant of relative depth. At such junctions, there is a change in the binding (or ownership) of contours, and it is this change which produces the discontinuity in perceived depth. Depth is represented by the relative level of activity in two topographic maps (called foreground and background). The closest object maximally activates foreground units and minimally activates background units; the most distant object has the reverse values, and objects located at intermediate depths display intermediate values. The initial state of the two maps is such that all closed contours lie in the background plane. Depth values are then modified at tag junctions-contours corresponding to the head of the "T" are pushed toward the foreground. Since multiple objects can overlap, a contour can be both occluding and occluded-therefore, the relative depth of a contour is determined in a type of push-pull process in which proto-objects are shuffled in depth. The contour binding tag is critical in this process in that all units with the same tag are pushed forward or backward together. (In the more general case of nonplanar objects, the alteration of depth values would depend on position along the contour.) Tag junctions arise in cases of partial occlusion; however, in some instances, a smaller object may actually lie directly in front of a larger object. In this case, which we call "surround" occlusion, the contour of the occluded object surrounds that of the occluding object. As shown in Figure 1, a separate process determines whether such a surround occlusion is present, and in the same manner as tag junctions, leads to a change in the representation of relative depth. The network mechanism for detecting surround occlusion is almost identical to that discussed above for determining the direction of figure (see Fig. 2). Note that a similar configuration of two concentric contours arises in the case of a "hole." The model is currently being extended to deal with such nonsimply connected objects. These processes-contour binding, determining direction of the figure, and determination of relative depth-define the proto-object. The remainder of the model is concerned with linking proto-objects into objects. The first step in this endeavor is to identify occlusion boundaries. Since occlusion boundaries are concave segments of contours, such segments must be detected (particularly, concave segments bounded by tag junctions). Although many machine vision algorithms exist for determining convexity, we have chosen to use a simple, neurally plausible mechanism: at each point of a contour, the direction of figure is compared to the direction of curvature [which is determined using endstopped units (Dobbins et al. 198711. In convex regions, the two directions are the same; in concave regions, the two directions are opposed. A simple AND mechanism can therefore identify the concave segments of the contours.

Object Discrimination

a

909

b

C

Figure 4: Linking of occluded contours. Three possible perceptual interpretations (below) of an occlusion configuration (above) are shown. Small arrows indicate direction of figure (inside/outside). Collinearity cannot be the sole criterion for linking occluded edges. Consistency in the direction of figure between linked objects rules out perception c. Once occlusion borders are identified, proto-objects can be linked by trying to extend, complete, or continue occluded segments. Linkage most commonly occurs between proto-objects in the background, i.e., between spatially separated occluded contours. For example, in Figure 3, the occluded contours which terminate at the two tag junctions can be linked to generate a virtual representation of the occluded segment. Since it is impossible to know exactly what the occluded segment looks like, and since it is not actually "perceived," we have chosen not to generate a representation of the occluded segment. Rather, a network link binds together the endpoints of the two tag junctions. In the case where multiple objects are occluded by a single object, the problem of which contours to link can become complex. As shown in Figure 4, one important constraint on this process is that the directions of figure be consistent between the two linked proto-objects. Another condition in which proto-objects can be linked involves the joining of occluding contours, i.e., of proto-objects in the foreground. This phenomenon occurs in our perception of illusory contours, for example, in the Kanizsa triangle (Kanizsa 1979) or when a gray disc is viewed against a background whose luminance changes in a smooth spatial gradient from black to white (Marr 1982; Shapley and Gordon 1987). In this case, a representation of the actual contour is generated. The conditions for linkage are that the two contours must be smoothly joined by a line or curve, and that the direction of figure be consistent (as in the case of occluded contours above).

910

Leif H. Finkel and Paul Sajda

The major difference between these two linking or completion processes is that contours generated in the foreground are perceived while those in the background are not. However, the same mechanisms are used in both cases. We have elected to segregate the foreground and background linking processes into separate networks for computational simplicity-it is possible, however, that in vivo a single population of units carries out both functions. Regardless of the implementation, the interaction between ongoing linking processes in the foreground and background is critical. Since these links are self-generated by the system (they do not exist in the physical world), they must be scrutinized to avoid false conjunctions. The most powerful check on these processes is their mutual consistencyan increased certainty of the occluded contour continuation being correct increases the confidence of the occluding contour continuation, and vice versa. For example, in the case of the Kanizsa triangle, the "pac-man"like figures can be completed to form complete circles by simply continuing the contour of the pac-man. The relative ease of completing the occluded contours, in turn, favors the construction of the illusory contours, which correspond to the continuations of the occluding contours. In fact, we believe that the interaction between these two processes determines the perceptual vividness of the illusion. The final steps in the process involve a recurrent feedback (or reentry, Finkel and Edelman 1989) from the networks that generate these links back to earlier stages so that the completed contours can be treated as real objects. Note that the occluded contours feedback to the contour binding stage, not to the line discrimination stage, since in this case, the link is virtual, and there is no generated line whose orientation, etc., can be determined. The feedback is particularly important for integrating the outputs of the two parallel paths. For example, once an occluding contour is generated, as in the illusory contours generated in the Kanizsa triangle, it creates a new tag junction (with the circular arc as the "tail" and the illusory contour as the "head" of the "T''). On the next iteration through the system, this tag junction is identified by networks in the other parallel path of the system (see Fig. 11, and is used to stratify the illusory contour in depth. 4 Results of Simulations

4.1 Linking Proto-objects. We present the results of three simulations which illustrate the ability of the system to discriminate objects. Figure 5 shows a visual scene that was presented to the system. The early networks discriminate the edges, lines, terminations, and junctions present in the scene. Figure 5A displays the contour binding tags assigned to different scene elements (on the first and fifth cycle of activity). Each box represents active units with a common tag, different boxes rep-

Object Discrimination

911

resent different tags, and the ordering of the boxes is arbitrary. Note that on the first cycle of activity, discontinuous segments of contours are given separate tags. These tags are changed by the fifth cycle as a result of feedback from the linking processes. Figure 5B shows the output of the directioiz offigure network, for a small portion of the input scene (near the horse’s head). The direction of the arrows indicates the direction of figure determined by the network. The correct direction of figure is determined in all cases: for the horse’s head, and for the horizontal and vertical posts of the fence. Once the direction of figure is identified, occluded contours can be linked (as in Fig. 4), and proto-objects combined into objects. This linkage is what changes the contour binding tags, so that after several cycles (Fig. 5A, right), separate tags are assigned to separate objects-the horse, the gate posts, the house, the sun. The presence of tag junctions (e.g., between the horse’s contour and the fence, between the house and the horse’s back) is used by the system to force various objects into different depth planes. The results of this process are displayed in Figure 5C, which plots the firing rate (percent of maximum) of units in the foreground network. The system has successfully stratified the fence, horse, house, and sun. The actual depth value determined for each object is somewhat arbitrary, and can vary depending on minor changes in the scene-the system is designed only to achieve the correct relative ordering, not absolute depth. Note that the horizontal and vertical posts of the fence are perceived at different depths-this is because of the tag junctions present between them; in fact, the two surfaces do lie at slightly different depths. In addition, there is no way to determine the relative depth of the two objects in the background, the house and the sun, because they bear no occlusion relationship to each other. Again, this conforms to human perceptions, e.g., the sun and the moon appear about the same distance away. The system thus appears to process occlusion information in a manner similar to human perception. 4.2 Gestalt Psychology of a Network. The system also displays a response consistent with human responses to a number of illusory stimuli. Figure 6 shows a stimulus, adapted from an example of Kanizsa (19791, which shows that preservation of local continuity in contours is more powerful than global symmetry in perception (this is contrary to classical Gestalt theory-eg., Koffka 1935). As shown in the middle panels, there are two possible perceptual interpretations of the contours-on the left, the two figures respect local continuity (this is the dominant human perception); on the right, the figures respect global symmetry. Figure 6A shows the contour binding tags assigned by the system to this stimulus, and Figure 6B shows the direction of figure that was determined. Both results indicate that the network makes the same perceptual interpretation as a human observer.

912

Leif H. Finkel and Paul Sajda

4.3 Occlusion Capture. The final simulation shows the ability of the system to generate illusory contours and to use illusory objects in a veridical fashion. The stimulus is, again, adapted from Kanizsa (1979), and shows a perceptually vivid, illusory white square in a field of black discs. The illusory square appears to be closer to the viewer than the background, and, in addition, the four discs that lie inside its borders also appear closer than the background (some viewers perceive the four internal discs to be even closer than the illusory square). This is an example of what we call ”occlusion capture,” an effect related to the capture phenomena involving motion, stereopsis, and other submodalities (Ramachandran and Cavanaugh 1985; Ramachandran 1986). In this case, the illusory square has “captured” the discs within its borders and they are thus pulled into the foreground. Figure 7A shows the contour binding tags after one (left) and three (right) cycles of activity. Each disc receives a separate tag. After the responses to illusory square are generated, the illusory contours are fed back to the contour binding network and given a common tag. Note that the edges of the discs occluded by the illusory square are now given the same tag as the square, not the same tags as the discs. The change in ”ownership” of the occluded edges of the discs is the critical step in defining the illusory square as an object. For example, Figure 7B shows the output of the direction o f f i p r e network after one and three cycles of activity. The large display shows that every disc is identified as an object with the inside of the disc correctly labeled in each case. The two insets focus on a portion of the display near the bottom left edge of the illusory square. At first, the system identifies the “L”shaped angular edge as belonging to the disc, and thus the direction of figure arrows point “inward.“ After three cycles of activity, this same “L”-shaped edge is identified as belonging to the illusory square, and thus the arrows now point toward the inside of the square, rather than the inside of the disc. This change in the ownership of the edge results from the discrimination of occlusion-the edge has been determined to

Figure 5: Facing p g e . Object discrimination and stratification in depth. Top panel shows a 64 x 64 input stimulus presented to the system. (A) Spatial histogram of the contour binding tags (each box shows units with common tag, different boxes represent different tags, and the order of the boxes is arbitrary). Initial tags shown on left; tags after five iterations shown on right. Note that linking of occluded contours has transformed proto-objects into objects. (B) Magnified view of a local section of the direction of figure network corresponding to portion of the image near horse‘s nose and crossing fence posts. Arrows indicate direction of inside of proto-objects as determined by network. (C) Relative depth of objects in scene as determined by the system. Plot of activity (% of maximum) of units in the foreground network after five iterations. Points with higher activity are ”perceived” as being relatively closer to the viewer.

Object Discrimination

913

be an occlusion border. The interconnected processing of the system then results in a change in the direction of figure and of the continuity tags associated with this edge. The illusory square is perceived as an object. Its four contours are bound together, the contours are bound to the internal surface, and the properties of the surface are identified.

B

C

914

Leif H. Finkel and Paul Sajda

Figure 7C displays the firing rate of units in the foreground map (as in 5C), thus showing the relative depths discriminated by the system. The discs are placed in the background, the illusory square and the four discs within its borders are located in the foreground. In this case, the depth cue which forces the internal discs to the foreground is not due to tag junctions, but rather to surround occlusion (see Figure 3 ) . Once the illusory square is generated, the contours of the discs inside the square are surrounded by that of the square. The fact that the contour is “illusory” is irrelevant; once responses are generated in the networks responsible for linking occluding contours and are then fed back to earlier networks, they are indistinguishable from responses to real contours in the periphery. Thus the system demonstrates occlusion capture corresponding to human perceptions of this stimulus. 5 Discussion

In most visual scenes, the majority of objects are partially occluded. Our seamless perception of the world depends upon an ability to complete or link the spatially separated, non-occluded portions of an object. We have used the idea that the visual system identifies proto-objects (which may or may not be objects) and then attempts to link these proto-objects into larger structures. This linking process is most apparent in the perception of illusory contours, and our model can account for a wide range of these illusions. This model builds upon previous neural, psychological, and machine vision studies. Several models of illusory contour generation (Ullman 1977; Peterhans and von der Heydt 1989; Finkel and Edelman 1989) have used related mechanisms to check for collinearity and to generate the illusory contours. Our model differs at a more fundamental level-we are concerned with objects not just contours. To define an object, surfaces must also be considered. For example, in a simple line drawing, we perceive an interior surface despite the fact that no surface properties are indicated. Thus, the model must be capable of characterizing a surface-and it does so, in a rudimentary manner, by determining the direction of figure and relative depth. Nakayama and Shimojo (1990) have approached the problem of surface representation from a similar viewpoint. They discuss how contours and surfaces become associated, how T-junctions serve to stratify objects in depth, and how occluded surfaces are amodally completed. Nakayama’s analysis concentrates on the external “ecological” constraints on perception. In addition to these Gibsonian constraints, we emphasize the importance of internal constraints imposed by physiological mechanisms and neural architectures. Nakayama has also explored the interactions between occlusion and surface attributes. A more complete model must consider such surface properties such as color, brightness, texture, and surface orientation. The examination of

915

Object Discrimination

O B A

B

Figure 6: Minimization of ambiguous discontinuities. Upper panel shows an ambiguous stimulus (adapted from Kanizsa 1979), two possible perceptual interpretations of which are shown below. The interpretation on the left is dominant for humans, despite the figural symmetry of the segmentation on the right. Stimulus was presented to the system, results shown after three iterations. (A) Spatial histogram showing the contour binding patterns (as in 5A). The network segments the figures in the same manner as human perception. (B) Determination of direction of figure confirms network interpretation (note at junction points, direction of figure is indeterminate).

916

Leif H. Finkel and Paul Sajda

how surface features might interact with contour boundaries has been pioneered by Grossberg (1987). Finally, in some regards, our model constitutes the first step of a "bottom-up" model of object perception (Kanizsa 1979; Biederman 1987). It is interesting that regardless of one's orientation (bottom-up or top-down) the constraints of the physical problem result in certain similarities of solution as witnessed by the analogies present with A1 based models (Fisher 1989). One of the most speculative aspects of the model is the use of tags to identify elements as belonging to the same object. Tags linking units responding to the same contour are used to determine the direction of figure and to change the perceived depth of the entire contour based on occlusion relationships detected at isolated points (the tag junctions). It is possible to derive alternative mechanisms for these processes that do not depend on the use of tags, but they are conceptually inelegant and computationally unwieldy. Our model offers no insight as to the biophysical basis of such a tag. However, the model does suggest that there should be a relatively small number of tags, on the order of 10, since this number corresponds to the number of objects that can be simultaneously discriminated. This constraint is consistent with several possible mechanisms: tags represented by different oscillation frequencies, tags represented by different phases of firing, or tags represented by firing within discrete time windows (e.g., the first 10 msec of each 50 msec interval). The number of distinct tags generated by these various mechanisms may depend on the integration time of the neuron, or possibly on the time constant of a synaptic switch, such as the NMDA receptor. At the outset, we discussed the importance of both binding and segmentation for visual object discrimination. Our model has largely dealt with the segmentation problem, however, the two problems are not entirely independent. For example, the association of a depth value with the object discriminated is, in essence, an example of the binding of an attribute to an object. Consideration of additional attributes makes the

Figure 7: Facing p u p . Occlusion capture. Upper panel shows stimulus (adapted from Kanizsa 1979) in which we perceive a white illusory square. Note that the four black discs inside the illusory square appear closer than the background. A 64 x 64 discrete version of stimulus was presented to the network. (A) Spatial histogram (as in 5A) of the initial and final (after three iterations) contour binding tags. Note that the illusory square is bound as an object. (B) Direction of figure determined by the system. Insets show a magnified view of the initial (left) and final (right) direction of figure (region of magnification is indicated). Note that the direction of figure of the "mouth of the pac-man flips once the illusory contour is generated. (C) Activity in the foreground network (% of maximum)demonstrates network stratification of objects in relative depth. The illusory square has "captured" the background texture.

917

Object Discrimination

C

918

Leif H. Finkel and Paul Sajda

problem more complex, but it also aids in the discrimination of separate objects (Damasio 1989; Crick and Koch 1990). For example, we have only considered static visual scenes, but one of the major cues to the linking process is common motion of proto-objects. During development, common motion may, in fact, play the largest role in establishing our concept of what is an object (Termine et al. 1987). Object definition also clearly depends on higher cognitive processes such as attention, context and categorization (Rosch and Lloyd 1978). There is abundant evidence that "top-down'' processes can influence the discrimination of figure/ground as well as the perception of illusory figures (Gregory 1972). The examples considered here (e.g., Figs. 5-7) represent extended visual scenes, and perception of these stimuli would require multiple shifts of gaze and/or attention. The representation of such a scene in intermediate vision is thus a more dynamic entity than portrayed here. The processes we have proposed are rapid (all occur in several cycles of iteration), and thus might be ascribed to preattentive perception. However, such preattentive processing sets the stage for directed attention because it defines segmented objects localized to particular spatial locations. Furthermore, the process of binding contours, surfaces, and surface features may be restricted to one or two limited spatial regions at any one time. Thus, feature binding may be a substrate rather than a result of the attentional process. We have implicitly assumed that object discrimination is a necessary precursor to object recognition. Ullman (1989) has developed a model of recognition that demonstrates that this need not logically be the case. The question of whether you have to know that something is a "thing" before you can recognize what kind of thing it is remains to be determined through psychophysical experiment. It is appealing, however, to view object discrimination as the function of intermediate vision, i.e., those processes carried out by the multiple extrastriate visual areas. In this view, each cortical module develops invariant representations of aspects of the visual scene (motion, color, texture, depth) and the operations of these modules are dynamically linked. The consistent representations developed in intermediate vision then serve as the substrate for higher level cognitive processes. In conclusion, we have shown that one can build a self-contained system for discriminating objects based on occlusion relationships. The model is successful at stratifying simple visual scenes, for linking the representations of occluded objects, and at generating responses to illusory objects in a manner consistent with human perceptual responses. The model uses neural circuits that are biologically based, and conforms to general neural principles, such as the use of a distributed representation for depth. The system can be tested in psychophysical paradigms and the results compared to human and animal results. In this manner, a computational model that is designed based on physiological data and

Object Discrimination

919

tested in comparison to psychophysical data offers a powerful paradigm for bridging the g a p between neuroscience and perception. Note Added i n Proof The recent findings of dynamic changes in receptive field structure in striate cortical neurons by Gilbert a n d Wiesel (1992) indicates that long-range connections undergo context-dependent changes in efficacy. Such a mechanism may provide the biological basis for the direction of figure a n d linkage mechanisms proposed here. [Gilbert, C. D., a n d Wiesel, T. N. 1992. Receptive field dynamics in adult primary visual cortex. Nnhire 356, 150-152.1 Acknowledgments This work w a s supported by grants from The Office of Naval Research (N00014-90-J-1864), The Whitaker Foundation, a n d The McDonnell-Pew Program in Cognitive Neurocience. References Aloimonos, J., and Shulman, D. 1989. Integration of Visual Modules. New York, Academic Press. Attneave, F. 1954. Some informational aspects of visual perception. Psych. Rev. 61, 183-193. Barlow, H. 8. 1981. Critical limiting factors in the design of the eye and visual cortex. Proc. R. Soc. (London) B212, 1-34. Biederman, I. 1987. Recognition by components: A theory of human image understanding. Psych. Rcz~94, 115-147. Crick, F., and Koch, C. 1990. Towards a neurobiological theory of consciousness. Sernin. Neurosci. 2, 263-275. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Comp. 1, 1223-1232. Dobbins, A. S., Zucker, S. W., and Cynader, M. S. 1987. Endstopping in the visual cortex as a neural substrate for calculating curvature. Nntrire (London), 329,438-441. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybernet. 60, 121-130. Finkel, L., and Edelman, G. 1989. Integration of distributed cortical systems by reentry: A computer simulation of interactive functionally segregated visual areas. I. Neurosci. 9, 3188-3208. Fisher, R. B. 1989. Froin Objects to Surfaces. John Wiley & Sons, New York. Gilbert, C. D., and Wiesel, T. N. 1989. Columnar specificity of intrinsic connections in cat visual cortex. J. Neurosci. 9, 2432-2442. Gray, C. M., and Singer, W. 1989. Neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1702.

920

Leif H. Finkel and Paul Sajda

Gregory, R. L. 1972. Cognitive contours. Nature (London) 238, 51-52. Grossberg, S. 1987. Cortical dynamics of three-dimensional form, color, and brightness perception. I: Monocular theory. Percept. Psyclzophys. 41,87-116. Grossberg, S., and Mingolla, E. 1985. Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychol. Rev. 92, 173-211. Guzman, A. 1968. Decomposition of a visual scene into three-dimensional bodies. Fall Joint Comput. Conf. 1968, 291-304. Kanizsa, G. 1979. Organization in Vision. Praeger, New York. Koffka, K. 1935. Principles of Gestalt Psychology. Harcourt, Brace, New York. Konig, P., and Schillen, T. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp. 3, 155-166. Lehky, S., and Sejnowski, T. 1990. Neural model of stereoacuity and depth interpolation based on distributed representation of stereo disparity. J. Neurosci. 7, 2281-2299. Livingstone, M. S., and Hubel, D. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 740-749. Marr, D. 1982. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco. Minsky, M., and Papert, S. 1969. Perceptrons. The MIT Press, Cambridge, MA. Mitchison, G., and Crick, F. 1982. Long axons within the striate cortex: Their distribution, orientation, and patterns of connections. Proc. Natl. Acad. Sci. U.S.A. 79,3661-3665. Nakayama, K., and Shimojo, S. 1990. Toward a neural understanding of visual surface representation. Cold Spring Harbor Symp. Quant. Biol. LV, 911-924. Peterhans, E., and von der Heydt, R. 1989. Mechanisms of contour perception in monkey visual cortex. 11. Contours bridging gaps. J. Neurosci. 9, 1749-1763. Poggio, G. F., Gonzalez, F., and Krause, F. 1988. Stereoscopic mechanisms in monkey visual cortex: Binocular correlation and disparity selectivity. J. Neurosci. 8, 45314550. Poggio, T., Gamble, E. B., and Little, J. J. 1988. Parallel integration of vision modules. Science 242, 436440. Ramachandran, V. S. 1987. Visual perception of surfaces: A biological theory. In The Perception oflllusory Contours, S. Petry and G. E. Meyer, eds., pp. 93-108. Springer-Verlag, New York. Ramachandran, V. S. 1986. Capture of stereopsis and apparent motion by illusory contours. Percept. Psychophys. 39, 361-373. Ramachandran, V. S., and Cavanaugh, P. 1985. Subjective contours capture stereopsis. Nature (London) 317, 527-530. Resnikoff, H. L. 1989. The Illusion of Reality. Springer-Verlag, New York. Rockland, K. S., and Lund, J. S. 1982. Widespread periodic intrinsic connections in the tree shrew visual cortex. Science 215, 1532-1534. Rosch, E., and Lloyd, B. B. 1978. Cognitionand Categorization. Lawrence Erlbaum, Hillsdale, NJ. Rosenfeld, A. 1988. Computer vision. Adv. Comput. 27, 265-308. Sajda, P., and Finkel, L. 1990. NEXUS: A neural simulation environment. University of Pennsylvania Tech. Rep.

Object Discrimination

921

Sajda, P., and Finkel, L. 1992. NEXUS: A simulation environment for large-scale neural systems. Sitnirlntioli, in press. Sejnowski, T., and Hinton, G. 1987. Separating figure from ground with a Boltzmann machine. In Visiotz, Brniti mid C o o p m t i v e Cotriptnfioti, M. Arbib and A. Hanson, eds., pp. 703-724. The MIT Press, Cambridge, MA. Shapley, R., and Gordon, J . 1987. The existence of interpolated illusory contours depends on contrast and spatial separation. In The Percc,ptiotl oflll~rsc~r-!/ Cotitours, S. Petry and C. E. Meyer, eds., pp. 109-115. Springer-Verlag, New York. Termine, N., Hrynick, T., Kestenbaum, T., Gleitman, H., and Spelke, E. S. 1987. Perceptual completion of surfaces in infancy. 1. Exp. Psychol. Hirtrinri Prvccpt. 13, 524-532. Ullman, S. 1989. Aligning pictorial descriptions: An approach to object recognition. Cogtiifion 32, 193-254. Ullman, S. 1977. Filling-in the gaps: The shape of subjective contours and a model for their generation. Biol. Cyberrtef. 25, 1-6. von der Heydt, R., and Peterhans, E. 1989. Mechanisms of contour perception in monkey visual cortex. I. Lines of pattern discontinuity. I. Nclrrosci. 9, 1731-1 748. ~

~~~

~~

~~

~

Received 22 November 1991; accepted 6 April 1992

This article has been cited by: 2. Warren W. Tryon. 2009. Cognitive Processes in Cognitive and Pharmacological Therapies. Cognitive Therapy and Research 33:6, 570-584. [CrossRef] 3. Axel Thielscher, Heiko Neumann. 2008. Globally consistent depth sorting of overlapping 2D surfaces in a model using local recurrent interactions. Biological Cybernetics 98:4, 305-337. [CrossRef] 4. Sheshadri R. Thiruvenkadam, Tony F. Chan, Byung-Woo Hong. 2008. Segmentation under Occlusions Using Selective Shape Prior. SIAM Journal on Imaging Sciences 1:1, 115. [CrossRef] 5. A. Thielscher, H. Neumann. 2007. A computational model to link psychophysics and cortical cell activation patterns in human texture processing. Journal of Computational Neuroscience 22:3, 255-282. [CrossRef] 6. R. Baumann, R. Zwan, E. Peterhans. 1997. Figure-Ground Segregation at Contours: a Neural Mechanism in the Visual Cortex of the Alert Monkey. European Journal of Neuroscience 9:6, 1290-1303. [CrossRef] 7. D.W. Jacobs. 1996. Robust and efficient detection of salient convex groups. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:1, 23-37. [CrossRef] 8. Steven J. Nowlan, Terrence J. Sejnowski. 1994. Filter selection model for motion segmentation and velocity integration. Journal of the Optical Society of America A 11:12, 3177. [CrossRef]

Communicated by John Bridle

An Adaptive Lattice Architecture for Dynamic Multilayer Perceptrons Andrew D. Back Ah Chung Tsoi Department of Electrical Engineering, University of Queensland, St. Lucia 4072, Australia

Time-series modeling is a topic of growing interest in neural network research. Various methods have been proposed for extending the nonlinear approximation capabilities to time-series modeling problems. A multilayer perceptron (MLP) with a global-feedforward local-recurrent structure was recently introduced as a new approach to modeling dynamic systems. The network uses adaptive infinite impulse response (IIR) synapses (it is thus termed an IIR MLP), and was shown to have good modeling performance. One problem with linear IIR filters is that the rate of convergence depends on the covariance matrix of the input data. This extends to the IIR MLP: it learns well for white input signals, but converges more slowly with nonwhite inputs. To solve this problem, the adaptive lattice multilayer perceptron (AL MLP), is introduced. The network structure performs Gram-Schmidt orthogonalization on the input data to each synapse. The method is based on the same principles as the Gram-Schmidt neural net proposed by Orfanidis (1990b),but instead of using a network layer for the orthogonalization, each synapse comprises an adaptive lattice filter. A learning algorithm is derived for the network that minimizes a mean square error criterion. Simulations are presented to show that the network architecture significantly improves the learning rate when correlated input signals are present. 1 Introduction

Neural network models have been successfully applied to many problems requiring static nonlinear mappings and classifications (that is, no time dependence exists between data). Yet to solve many practical problems, it is necessary to model dynamic systems, where the mapping to be learned depends on the present and past states of the system (these may be measurable outputs or inaccessible states). A number of architectures have been used in neural network models to approximate time-dependent systems. These include windows of time-delayed inputs Neural Computation 4, 922-931 (1992)

@ 1992 Massachusetts Institute of Technology

Dynamic Multilayer Perceptrons

923

(Lapedes and Farber 1987), state inputs (Guez and Selinsky 1988), and recurrent connections (Jordan 1988; Robinson 1989; Williams and Zipser 1989). A network having a local-recurrent global-feedforward structure, was recently introduced (Back and Tsoi 19914, in which the synapses are adaptive infinite impulse resporzse (IIR) filters. The IIR synapse multilayer perceptron (IIR MLP), is a generalization of multilayer perceptrons involving only feedforward time delays [such as the time delay neural network (TDNN')]. This approach differs from other recurrent networks (Robinson 1989; Williams and Zipser 1989; Jordan 1988; Elman 1990) where global feedback is used. An algorithm for an FIR MLP was first presented by Wan (1990). The synaptic equation for an IIR MLP is given by

where y(t) is the synaptic output, b, j = 0.1.. . . . rn are the feedforward A coefficients, a, j = 1 . 2 . . . . .ri are the feedback coefficients, and q - l z ( t ) = z ( t - j ) . It has been shown by simulation (Back and Tsoi 1991a) that this class of networks is a better model than a network with only feedforward time-delay synapses. Since the network overcomes the finite window limitation of the TDNN, it can model long term dynamic behavior. Correlated inputs are known to result in slow learning times for linear adaptive filters (Haykin 1986). This difficulty in learning was also recognized to occur in multilayer perceptrons by Orfanidis (1990b). The IIR MLP is subject to the same problem: when the inputs are not white noise, learning at any particular synapse is slowed by a factor proportional to the eigenvalue spread of the input data to that synapse. In linear filtering theory, the adaptive lattice filter (Friedlander 1982) has been designed to overcome the problem by performing a Gram-Schmidt orthogonalization on the input data. Adaptive lattice filters have additional advantages in terms of reduced sensitivity to finite-precision effects (as would be encountered in VLSI implementations). In this paper, a multilayer perceptron with adaptive lattice synapses is introduced to overcome the above mentioned limitations of the IIR MLP. The outline of the paper is as follows: in Section 2, a lattice network architecture is introduced. In Section 3, a learning rule is derived by extending a previous algorithm for a linear adaptive lattice filter. Simulations of the IIR MLP and AL MLP modeling a nonlinear dynamic 'The synapses in the TDNN can be considered as finite impulse response (FIR) filters, although the overall structure of the TDNN may be quite different (Waibel rt a / . 1988). A more recently introduced algorithm, the autoregressive backpropagation algorithm (Leighton and Conrath 1991), is a special case of the IIR MLP. It is readily seen that the AR neuron is equal to two neurons in our terminology: the first neuron having a single FIR synapse, followed by a second, linear-output neuron with a n all-pole IIR synapse.

Andrew D. Back and Ah Chung Tsoi

924

Figure 1: A two-multiplier lattice synapse.

system with correlated inputs are presented in Section 4. Conclusions are given in Section 5.

2 An Adaptive Lattice MLP Architecture

In this section, a network architecture based on the multilayer perceptron, but with adaptive lattice synapses is introduced. Each synapse is a twomultiplier IIR lattice (Fig. l),although other structures involving various numbers of multipliers could be used (Gray and Markel 1973; Tummala 1989). A detailed review of adaptive lattice filters can be found in (Friedlander 1982). The AL MLP is defined as follows (its structure is shown in Fig. 2). The forward equations for layer I ( I = 0.1.2,. . . .L, i.e., there are L 1 layers in the MLP) are

+

(2.3) (2.4)

Dynamic Multilayer Perceptrons

925

Figure 2: An adaptive lattice multilayer perceptron (AL MLP).

neuron output, neuron state, ith synaptic output, ith synaptic gain, jth feed forward coefficient in the ith synapse, jth backward residual in the ith synapse, order of each lattice synapse. The flow of information through a synapse is the same as in the linear filter case (Parikh et al. 1980):

b,(t) f,-l(f)

= =

1 ) + k,(t)f,-,(t). fJt) - k,(t)b,-l(t - 1).

b,-l(t

-

j j

1.2.. . . . m = m.rn - 1 . . . . 1 =

(2.5) (2.6)

with initial and boundary conditions:

(2.7) (2.8)

926

Andrew D. Back and Ah Chung Tsoi

Nonlinear Plant ........................................................

c

i

-

*

Lattice Synapse

Reduced Complexity AL MLP

+ Sigmoidal units

Figure 3: A reduced complexity adaptive lattice MLP modeling a dynamic nonlinear plant. With reference to Figure 1, z ( t ) in 2.8 is the output of the neuron in the previous layer 1 - 1. It is often possible to use a network architecture with reduced complexity to model the situation when the unknown system involves only a linear dynamic part followed by a static nonlinear part. In this case, only the first layer ( I = 1)has lattice synapses (equations 2.1-2.8 apply). For layers (2)-(Z.), the network has no dynamic synaptic structure and is defined in the usual way:

(2.10) where

k = 1 , 2 , ...,N,+1 i ~ ,= bias The reduced complexity network is shown in Figure 3. In the next section, a gradient descent learning algorithm is derived for the AL MLP.

Dynamic Multilayer Perceptrons

927

3 A Learning Algorithm for the AL MLP

In this section, a learning algorithm is developed based on the backpropagation algorithm (Rumelhart et al. 1986). Each coefficient in the lattice synapses is adjusted iteratively, minimizing the mean square error criterion, defined as

where yk(t) is the desired output at time t, and NL is the number of neurons in the output layer (layer L). The update equations for each synapse are similar to the linear case (Parikh rt al. 19801, except that the error term is replaced by a backpropagated delta [ & ( t ) ] , and the synaptic gain terms [ g ( t ) ] are included. The feedforward coefficients for each synapse are updated according to

v ( t + 1)

=

v ( t )+ AZJ(f)

Av(t)

=

-?/-

(3.2)

X(t)

(3.3)

all(t )

(3.4)

(3.5)

The delta term hL(t) for the kth neuron in layer l is computed in a similar manner to the original backpropagation algorithm, taking into account the lattice structure of the synapses. Thus for an L + 1 layer network (3.6)

where vkh,(f) is the jth feedforward coefficient of the synapse between the kth and kth neurons in layers I and 1 1, respectively, and

+

(3.8) (3.9)

Andrew D. Back and Ah Chung Tsoi

928

The reflection coefficients are updated according to

kj(t

+ 1)

=

k,(t)

+ Ak(t)

(3.10)

(3.11) where

The synaptic gain term introduced in 2.2, is updated according to

(3.14)

For the reduced complexity network, the delta terms In:( t ) ] , are calculated the same as in the normal backpropagation algorithm (Rumelhart et al. 1986) by replacing 3.7 with & t ) = f"i:(t)]

c Sl;t'(t)w:'

(3.16)

h=l

The learning algorithm for the AL MLP is an extension of the linear lattice filter derived by Parikh et al. (1980), and offers a method of applying the known advantages of Gram-Schmidt orthogonalization in adaptive lattice filters to nonlinear systems. Consult Parikh et al. (1980) or Friedlander (1982) for further details of the lattice implementation. 4 Simulation Examples The performance of the AL MLP was compared against an IIR MLP by modeling a nonlinear plant described by

yp(t)= sin

0.0154 + 0.0462q1' + 0.0462q-2 + 0.0154qp3 x(t)] 1 - 1.99q-' 1.5729-' - 0.4583qp3

+

(4.1)

Dynamic Multilayer Perceptrons

929

The networks were trained with correlated input signals generated by a white noise source passed through the filter (4.2) where a1 = -1.6 and a2 = 0 95. In this case, the eigenvalues of the covariance matrix were (3.43.0.35)giving an eigenvalue spread of 3.4310.35 = 9.8. Similarly, for al = -1.8, the corresponding eigenvalues were (8.13, 0.32) giving an eigenvalue spread of 25. The nonlinear system was modeled as shown in Figure 3. Note that a reduced complexity network is used for each network, as introduced in Back and Tsoi (1991b). The order of the IIR synapse was selected as ( m . n ) = (7.7), and the AL synapse order was m = 7. The learning rates chosen were i/dyna,,,,c = 0.0001 and ilStatlc= 0.05, where the rfyiiamic subscript refers to the time-delay synapses, and the static subscript refers to synapses without time delays. The average mean squared error was plotted during learning, using 50 runs with each point being a moving average of the previous 20 points (Fig. 4). Though not shown here, it is observed that as the eigenvalue spread of the input data increases, the time to achieve convergence for each network increases, but the AL MLP has significantly better performance than the IIR MLP. 5 Conclusions

The IIR MLP was presented recently as a means of modeling nonlinear dynamic system systems with good results. A problem exists however, if the input data is correlated (this can occur in many signal processing applications: for example, in speech or sonar signals). The problem of correlated inputs for the standard MLP was also recognized by Orfanidis who proposed the use of a Gram-Schmidt orthogonalization layer in the network structure (Orfanidis 1990b). In this paper, a solution to the problem of correlated input data in the IIR MLP was proposed by introducing the notion of adaptive lattice filters for each synapse. The adaptive lattice network structure was described and a learning algorithm presented. Computer simulations were used to verify that the adaptive lattice multilayer perceptron is capable of significantly improving the time to achieve convergence when the input data are correlated.

Acknowledgments The authors wish to thank the reviewer for helpful comments and suggestions. The first author is supported by a Research Fellowship with the

Andrew D. Back and Ah Chung Tsoi

930

0.75

a

3

E:

0.00 0

100

200

300

400

500

300

400

500

Samples x 100

0.50 b

I

E 0.25

0.00 0

100

200

Samples x 100

Figure 4: Mean square error during learning for IIR MLP and AL MLP (averaged over 50 runs): (a) eigenspread of input = 9.8, (b) eigenspread of input = 25.

Electronics Research Laboratory, DSTO, Australia. The second author acknowledges partial support from the Australian Research Council.

Dynamic Multilayer Perceptrons

931

References Back, A. D., and Tsoi, A. C. 1991a. FIR and IIR synapses, a new neural network architecture for time series modelling. Neural Cornp. 3(3), 352-362. Back, A. D., and Tsoi, A. C. 1991b. Analysis of hidden layer weights in a dynamic locally recurrent network. Artificial Neural Networks, T. Kohonen, K. Makisara, 0. Simula, and J. Kangas (eds.), pp. 961-966. Elsevier Science Publishers B.V., North-Holland. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Friedlander, 8. 1982. Lattice filters for adaptive processing. Proc. ZEEE 70(8), 829-867. Gray, A. H., and Markel, J. D. 1973. Digital lattice and ladder filter synthesis. I E E E Trans. Audio Electroacoust. 21, 491-500. Guez, A., and Selinsky, J. 1988. A neuromorphic controller with a human teacher. Proc. IEEE Int. loitit Conf. Neural Networks 11, 595-602. Haykin, S. 1986. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ. Jordan, M. I. 1988. Supervised learning and systems with excess degrees of freedom. COINS Tech. Rep. 88-27, University of Massachusetts, Amherst. Lapedes, A,, and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction and System modelling. Tech. Rep. LA-UR87-2662, Los Alamos National Laboratory. Leighton, R. R., and Conrath, B. C. 1991. The autoregressive backpropagation algorithm. Proc. l E E E Int. joint Conf. Neural Netzuorks 11, 369. Orfanidis, S. J. 1990a. Optimum Signal Processing, 2nd ed. McGraw-Hill, New York. Orfanidis, S. J. 1990b. Gram-Schmidt neural nets. Neural Comp., 2, 116-126. Parikh, D., Ahmed, N., and Stearns, S. D. 1980. An adaptive lattice algorithm for recursive filters. I E E E Trans. Acous., Speech, Signal Proc. 28, 110-111. Robinson, A. J. 1989. Dynamic error propagation networks. Ph.D. dissertation, Cambridge University Engineering Department. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, Ch. 8. The MIT Press, Cambridge, MA. Tummala, M. 1989. One-multiplier adaptive lattice algorithm for recursive filters. Circuits, Systems, and Signal Processing 8(4), 455466. Waibel, A,, Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. I E E E Trans. Acoust., Speech, Signal Process. ASSP-37, 328-339, March. Wan, E. A. 1990. Temporal backpropagation for FIR neural networks. Proc. I E E E Znt. Joint Conf. Neural Networks I, 575-580. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 268-278.

Received 31 May 1991; accepted 9 April 1992.

This article has been cited by:

Communicated by Christof Koch

A Model Circuit for Cortical Temporal Low-Pass Filtering R. Maex G. A. Orban Laboraforium uoor Neuro- err Psyclrofysiologie, Katholieke Uniuersiteit Leuuen, Campus Gasthuisberg, B-3000 Leuuen, Belgium

We propose that the low-pass characteristics in the temporal and velocity domain of area 17 cells are generated by the abundant excitatory connections between cortical neurons. We have incorporated this anatomical feature in a model circuit in which simple cells' firing is initiated by geniculocortical excitatory synaptic input, with a short time course, and firing is maintained by feedback corticocortical excitatory synapses, which have a longer time course. The low-pass performance of the model is demonstrated by computing the model simple cells' velocity response curves (VRC) elicited by moving bars, and comparing these to those of its LGN (lateral geniculate nucleus) inputs. For the same parameter set, the VRCs of sustained and transient LGN cells are transformed into VRCs typical of central area 17 and central area 18 cells, respectively. 1 Introduction

It has been argued that only cortical circuitry can account for the temporal properties of cortical cells, and particularly for the abundance of velocity low-pass cells in area 17 of cat and monkey (Orban et al. 1981a; Movshon et al. 1978; Lee et al. 1981; Derrington and Fuchs 1979; Orban et al. 1986). Neither the input from the lateral geniculate nucleus (LGN) nor the membrane time constant of cortical neurons can generate the lowpass time constant of 53 to 80 msec (Worgotter and Holt 1991) measured for area 17 simple cells near the area centralis. Functionally, these velocity low-pass simple cells are suited as spatial analyzers during fixation (Duysens et al. 1985b; Orban 1991). Computationally, low-pass filtering is important for creating and preserving direction selectivity. About one-fourth of velocity low-pass cells are direction selective (Orban et al. 1981b; Orban et 01. 1986). This can be achieved by combining inputs that are spatially and temporally separated (Maex and Orban 1991). Due to temporal aliasing, models of direction selectivity begin to prefer the opposite direction of moving gratings at twice their optimal temporal frequency. Since most area 17 cells have an optimal temporal frequency of 1 to 4 Hz, while their geniculate input responds Neural Computation

4, 932-945 (1992)

@ 1992 Massachusetts Institute of Technology

Cortical Low-Pass Filtering

933

optimally near 10 Hz (Lee ct al. 1981), a low-pass mechanism is needed to eliminate the responses at higher speeds. In this paper, we present a biologically realistic model cortical circuit producing temporal low-pass filtering in simple cells, a n d the model’s responses to bars moving a t different speeds. The contribution of different sources of low-pass filtering in the model is discussed. 2 Model and Simulations 2.1 Model Cortical Circuit. The full model (Fig. 1A) comprises both excitatory (putative pyramidal or spiny stellate cells) a n d inhibitory (pu-

A

B

Figure 1: Full (A) and reduced (8)model for studying cortical low-pass filtering in simple cells. Circles represent neuron somata, > = excitatory synapse, o = inhibitory synapse. (A) The RFs of ON- and OFF-center LGN cells (E and F) totally overlap. Only a single discharge region of the cortical simple cells is modeled: an ON subregion for neurons A and B, an OFF subregion for neurons C and D. Inhibitory neurons (filled circles, B and D) inhibit neurons with RFs of opposite polarity (eg., cell B inhibits cell C and cell D). Excitatory neurons (open circles, A and C) excite neurons the RFs of which have the same polarity (e.g., cell A excites cell A and cell B). (B) Reduced version of the model in A. Only low-pass mechanisms 1 and I1 (see text) are preserved. Neuron A has been decomposed into a pool of 3 cells (a, b, and c). The LGN input into neurons b and c (dashed lines) could be deleted to study the behavior of second-order neurons.

934

R. Maex and G. A. Orban

tative basket or clutch cells) neurons. The excitatory cells excite excitatory as well as inhibitory cortical neurons receiving LGN input of the same polarity, while the inhibitory cells inhibit cortical cells whose major LGN input has the opposite polarity. The receptive fields (RF)of these cortical neurons totally overlap and have but a single discharge region, which has the same polarity as their LGN input. One consequence is that simple cells with a discharge region of opposite polarity mutually inhibit each other in a push-pull way (Palmer and Davis 1981). Modeling a single subregion is acceptable since the low-pass characteristics of simple cells do not critically depend on spatiotemporal interactions within one subregion or between neighboring subregions (Duysens et al. 1985a),and can be predicted from their responses to stationary stimuli (Duysens eta/. 1985b; Baker 1988). The model circuit contains three candidate sources of low-pass filtering (labeled in Fig. 1A): feedback excitation (I), temporal integration by synapses (111, and mutual inhibition (111).

2.2 Single-Neuron Models. A detailed, formal description of cortical neurons and their geniculate input is presented in the appendix. The LGN responses were computed by the following sequence: convolution of the stimulus with the gaussian center and surround weighting functions, separate temporal band-pass filtering for center and surround, subtraction of surround from center responses, addition of spontaneous activity, half-wave rectification, and finally spike generation. Cortical neurons are single passive compartments as in Worgotter and Koch (1991). Their spikes are all-or-nothing events that occur with a high probability above a fixed threshold. In the excitatory neurons, a spike is followed by a fast-decaying and a medium-duration afterhyperpolarization (AHP) (Schwindt et al. 1988). The fast-decaying AHP ensures a regular firing, while accumulation of medium-duration AHPs causes some adaptation, lowering the firing rate (Connors et a/. 1982). Inhibitory neurons have only a short AHP, and hence are fast spiking. Both the synaptic conductances and the AHPs are modeled as conductance changes, the time course of which is an rr-function (Jack e t a / . 1975).

2.3 Parameter Setting. LGN cell centers respond to stimuli of opposite polarity with a latency difference of 32-40 msec (Orban et a/. 1985). The RFs of transient cells are twice the size of those of sustained cells (Orban et al. 1985). This, together with a different value for the parameter governing the degree of transientness of the response to stationary stimuli, creates two types of velocity response curves (VRC), which match the data of Gulyas et al. (1990). Since we were interested mainly in transformations in the temporal domain and also because our stimulus was a narrow bar, we did not model the spatial nonlinearities of transient (or Y) cells.

Cortical Low-Pass Filtering

935

With the parameter values chosen for the AHPs of the excitatory cortical neurons, a steady-state frequency/intensity slope of 24 Hz/nA was measured on current injection, which falls within the range reported by Connors et a / . (1988). The inhibitory synapses have a slow component (time to peak 80 msec, Douglas and Martin 1991) and a fast component (time to peak 2 msec). Since simple cells commonly show a clear hyperpolarization on stimulation with nonoptimal contrast polarity, both components were modeled as conductance changes of the potassium channels, induced by GABABreceptors. The sensitivity of the model’s response to the time course and strength of both geniculocortical and corticocortical excitatory synapses was examined in a systematic way. In the final simulations (Fig. 2C,G), the geniculocortical synapses had a short time course (time to peak 1 msec) (Hagihara et al. 19881, the corticocortical excitatory synapses had a slow time course (time to peak 16 msec) and the ratio of the total conductances induced by these synapses was about 1 over 8 (see Appendix). The width of the discharge region of the sustained-input simple cells, derived from the histograms, is between 0.3 and 0.4 degrees. 2.4 Simulation Methodology. Peristimulus-time histograms were computed for 8 speeds of a narrow light bar by accumulating spikes, generated during at least 80 sweeps, in bins of 8 msec. The VRCs plot the peak responses from the unsmoothed histograms (Orban et a/. 1981a). The membrane potential was recorded as well, to evaluate subthreshold and hyperpolarizing responses. Sometimes, e g , for computing the parameter sensitivity in the feedback excitatory circuit, only a part of the full model was simulated (Fig. 1B). In these simulations, as well as in control simulations of the full model, the excitatory cells were decomposed into a pool of 3 (Fig. 1B) to 7 neurons, in order to prevent a systematic coincidence of the AHPs with the feedback excitation. In that case, the plotted VRCs are the means of the VRCs of the composing neurons. The simulation environment written by the authors in Pascal was run on a microvax. First-order nonlinear differential equations were solved with a central difference method. 2.5 Results. To evaluate the low-pass filtering performed by the model, we compared the VRCs of model sustained (Fig. 2A) and transient (Fig. 2E) LGN cells with model simple cells driven by sustained (Fig. 2C) or transient (Fig. 2G) LGN input, when stimulated with moving narrow light bars. First (Fig. 2B,F) we examined the effect of the time course of the geniculocortical synapse. The rr-functions describing these time courses have been normalized, i.e., their integrals are constant, as in the original formulation by Jack et a / . (1975). For different time-to-peak values, the amplitude of the excitatory postsynaptic potential (EPSP) of an isolated,

936

R. Maex and G. A. Orban

nonspiking neuron was measured, hence its membrane potential had no components due to feedback excitation, inhibition or AHPs. The only effect (Fig. 2B,F) is an attenuation of the EPSPs at high speeds when time-to-peak values increase. Note that at most speeds the peak EPSP of the sustained-input cell does not reach the threshold (15mV), even if that same neuron generates strong responses in the full model. There are two reasons for this. First, the peak EPSPs plotted are the means over 80 trials in bins of 8 msec, which is, however, half the membrane time constant. Second, the EPSPs received no contribution from the feedback excitation, which is initiated as soon as the threshold is crossed and will maintain firing as long as geniculocorticalexcitation keeps the membrane potential close enough to the threshold. In all VRCs presented infra, the time to peak was set to 1 msec, assuming that receptors of the non-NMDA (Nmethyl-D-aspartate) type are involved in geniculocortical transmission (Hagihara et al. 1988). The responses of the full model (Fig. 2C,G) show a shift of the VRCs to lower speeds, creating velocity low-pass behavior only for sustainedinput cells (Fig. 2C), and not for transient-input cells (Fig. 2G). Notice that the same cortical circuit that transforms the VRCs of sustained inputs into low-pass curves like those recorded in area 17 (Fig. 2D) also transforms VRCs of transient inputs into curves tuned to slow speeds, typical of area 18 (Fig. 2H). The slight attenuation of the response at the lowest speed in Figure 2C is due to the adaptation of the excitatory neuron. The responses of the OFF-simple cells are strongly reduced by the threshold and by mutual inhibition (cell C and cell D in Fig. 2C,G). The ON-simple cell starts inhibiting the OFF-cell as soon as the light bar enters its RF and so prevents the OFF-cell from firing when the bar leaves the RF, except at higher speeds for the transient-input cells. The sensitivity of the shape of these VRCs to the parameters governing the excitatory feedback loop was mainly examined on the reduced model of Figure 1B. [We feel that a decrease of up to 4 times in simulation time justifies this reduction. Indeed, the only change to cell A Figure 2: Facing page. (A, C, E, and G) Simulated, normalized velocity response curves (VRCs, see Simulation Methodology). VRCs are computed for a model sustained LGN cell (A), a model transient LGN cell (El, and model simple cells receiving sustained (C) or transient (G) LGN input. The insets attach each VRC to a neuron of Figure 1A. The vertical bars in A span the standard deviations. (Dand H) Representative examples of VRCs to moving light bars for velocity low-pass (D) and velocity tuned (H) cells, which are the predominant cell types in central area 17 and central area 18, respectively, of the cat (from Orban eta!. 1981a). (B and F) Peak EPSP values (see text) of isolated, nonspiking cortical neurons receiving sustained (8) and transient (F) input for different time-topeak values of the geniculocortical synapse (see insets). (Note that the velocity axes in the simulation graphics are differently scaled for sustained and transient cases.)

937

Cortical Low-Pass Filtering

SUSTAINED

0 0 0 5

TRANSIENT

0

2 0

8 0

32 0

128 0

04 1 0

H

velocity (degrees/ s)

4 0

16.0

64 0

256 0

r e i ~ ~ l i TUNED y

velocity (degrees / S )

938

R. Maex and G. A. Orban

in Fig. 1A is a loss of its inhibitory input from cell D. Now, cell D did not usually respond at all in the full model (Fig. 2Cl.l The results for sustained input are shown in Figure 3. Increasing the absolute strength of the corticocortical feedback connections, while keeping the geniculocortical weight constant (Fig. 3A, the time to peak of the corticocortical synapses here is 16 msec), amplifies the responses at low speeds until saturation occurs. Increasing time-to-peak values of the corticocortical synapse (Fig. 3B) shifts the velocity at which responses are maximally amplified toward lower speeds. For values close to 16 msec, an optimal velocity low-pass behavior is produced (Fig. 38, compare with Fig. 2D). Changing the absolute strength of the geniculocortical input (Fig. 3C) influences the VRCs in two ways. At low input strengths, the threshold is reached in only a fraction of the sweeps, decreasing the responses at all speeds. At high input strengths, responses at all speeds increase, and discharge regions become wider, hence the upper cutoff velocity (the speed, on the declining part of the VRC, at which the response is half the maximum response) increases until the VRCs become broad-band. Finally (Fig. 3D), we looked at the VRCs of second-order neurons (neurons b and c in Fig. 1B). Since their LGN input must pass an additional neuron and synapse, their upper cutoff velocity decreases, but the responses at low speeds can be amplified as well as in first-order neurons. 3 Discussion

The problem treated in this article is how cortical low-pass filtering might be generated so that the time constants measured by visual stimulation exceed the biophysical membrane time constants of cortical neurons by a factor of u p to eight [53-80 compared to 10-20 msec (Worgotter and Holt 199111. To this end, we have implemented three candidate mechanisms of temporal low-pass filtering in a model circuit for simple cells, and have computed their responses to bars moving at different speeds. The three biologically realistic mechanisms are feedback corticocortical excitatory synapses, mutual inhibition between neurons receiving input of opposite polarity, and temporal integration by synapses with a long time course. Feedback OY self-excitation generates low-pass filtering in a straightforward way: consider an analog neuron with time constant T and activity V , receiving inputs W and V with weights b and u, from the LGN and itself, respectively. For positive values of V, the state-equation can be written as dV 7= - V + a V + bW dt which rescales to r dV -- = - V + - b W 1 - 0 dt 1-a

Figure 3: Parameter sensitivity of the excitatory feedback loop for sustained-input cells. (A, B, and C) All VRCs are the mean VRCs of neurons a, b, and c of Figure 1B. All three neurons are driven by sustained LGN cells. The bold solid VRCs were obtained with the parameter values given in the appendix and used for the simulations of Figure 2C. The parameters changed were the absolute weight of the feedback corticocortical excitatory synapses (A), the time to peak of these corticocortical synapses (B), and the strength of the geniculocortical input (C). The effects of the time-to-peak value of the corticocortical synapses (8) and of the strength of the geniculocortical input (C) on the shape of the VRCs were computed for a range of corticocortical synaptic weights. Each VRC plotted is the one which, for the parameter value tested (see insets), showed the clearest low-pass behavior without exceeding the dynamic range of cortical neurons. The corresponding value of the corticocortical weight is indicated in parentheses in the insets. All these weights have been scaled to the values used for the VRCs in bold solid lines. (D) VRCs of cells a, b, and c (see inset) of Figure 1B. Only neuron a receives first-order LGN input; b and c are second-order neurons. The weights of the geniculocortical and the corticocortical synapses are 1.5 and 1.4 times those used in the bold solid VRCs of A, B, and C.

velocity (dcarees/ I)

D

940

R. Maex and G. A. Orban

To achieve a 5-times increase in time constant, a must take a value of 0.8. Since the dynamic range of cortical neurons is about half that of LGN cells (Orban et al. 1985), b must be about 0.1 (for steady-state conditions). Thus as a first estimate, the ratio of intra- over subcortical excitatory input conductances should be about 8 in order to produce the desired low-pass filtering. This is precisely the value Douglas and Martin (1991) used, based on ratios of corticocortical over geniculocortical synapses in anatomical studies, in order to simulate their intracellular recordings. In fact, this value of 8 underestimates the low-pass filtering performed by the cortex itself. The time constants measured for cortical cells describe the low-pass filtering between the visual stimulus (not the LGN, which is temporally band-pass) and the cortical response. Moreover, this ratio can hardly be increased because of stability conditions. To ensure stability, Douglas and Martin (1991) recently proposed a microcircuit in which cortical amplification is controlled by inhibitory synapses. We incorporated this feature in our model of Figure 1A (now cell B and cell D inhibit all cells including themselves) and were able to reproduce our VRCs by concomitantly increasing the weight of the feedback excitation, and hence the number of excitatory corticocortical synapses. So our conclusions concerning the transformations in the spatiotemporal domain presented in this paper are not changed by this inhibitory control. Given the above assumption about the dynamic range of cortical and geniculate neurons, the ratio of the corticocortically over geniculocortically induced integrated conductances derived from the simulations of Figure 2C was about 8.1, a value close to the theoretical prediction (see Appendix). Mutual inhibition between neurons receiving input of opposite polarity causes neurons to integrate the difference between their inputs with a time constant that depends on the weight of the inhibition, while common input components are transmitted with a very short time constant (Cannon et al. 1983). This mechanism, proposed for temporal integration of velocity signals in the oculomotor centers, works only as long as the neurons’ input-output function can be linearized around both neurons’ activity values, i.e., it requires a high spontaneous activity or a high common input. Simple cells have little spontaneous activity, but some experiments indicate a high convergence of ON- and OFF- LGN input (Sillito 1975). We were able to exploit this mechanism to shift the upper cutoff velocity of simple cells toward lower speeds in a circuit of analog neurons, but failed to reproduce this result with the more detailed neuron model used in this paper. Temporal integration by synapses with a long time course has been demonstrated to be necessary for normal visual stimulation of the cortex (Miller et al. 1989). These authors found in all cortical layers, after infusion of the cortex with the NMDA-antagonist APV, a high correlation between a neuron’s excitability by visual stimuli and by locally applied NMDA.

Cortical Low-Pass Filtering

941

This temporal integration ( T l ) can be approximated for continuous spike trains of duration T as

This is a sigmoid function with maximum at TI = 1 and values TI(ntr) = 1- [ ( n l)/e"] for integer ti. The assumption of continuous inputs holds, since cortical cells are driven by at least tens of LGN cells (Martin 1988). The greater the time to peak N of the synapse, the longer the duration T of the afferent spike train must be to prevent response attenuation, and the more slowly the EPSP builds up. Synapses with long time courses can at best yield a plateau EPSP of the same amplitude as the EPSP peak in short-time synapses. Temporal integration has been suggested to occur at the level of the geniculocorfical synapse (Miller et al. 1989), however, it will have a stronger low-pass effect if it occurs corticocortically: the time course of the corticocortical synapse determines the range of velocities for which the corticocortically induced EPSP coincides with the geniculocortically induced EPSP. Our measurements show that in the velocity domain, optimal timeto-peaks for corticocortical synapses, for area 17 as well as for area 18 cells, range between 8 and 16 msec. The induced EPSPs peak then between 16 and 32 msec and these values fall within the range of NMDAinduced EPSPs (Jones and Baughman 1988). For shorter time courses, the responses at medium and high velocities are amplified as well, and the limited dynamic range of (cortical) neurons prevents a low-pass behavior. For larger time-to-peak values, the corticocortically generated EPSPs peak when the geniculocortically built up EPSPs are already fading away, except at very low speeds. However, we did not model voltage-dependent conductances and so are not able to draw further conclusions on the role of NMDA receptors in the corticocortical synapses. Since low-pass cells are rare in area 18 of the cat (Orban et a / . 1981b; Movshon et al. 1978), which receives exclusively transient Y-cell input from the LGN, we also modeled transient-input simple cells. Within the parameter range eliciting low-pass behavior in sustained-input simple cells, the VRCs of transient-input simple cells became tuned to a low speed and were not low-pass. It is noteworthy that this difference in velocity properties is exclusively due to the difference in input, and not to a difference in the time constant of cortical model cells. This fits with recent observations of Martin's group who failed to observe a difference in time constants between area 17 and 18 cortical cells (K. Martin, personal communication). Although the distribution of sustained and transient cells is organized differently in the monkey, our model applies equally to the parvocellular and magnocellular systems, and to the generation of velocity low-pass cells in central areas V1 and V2 of the monkey (Orban cf al. 1986). Indeed, the temporal frequency response functions of our modeled sustained and

+

R. Maex and G. A. Orban

942

transient geniculate cells are close to those of parvo- and magnocellular geniculate cells (Derrington and Lennie 1984; Hicks et al. 1983). Moreover, low-pass cells were observed in area V1 of the monkey mainly outside the magnocellular recipient layers (Orban et al. 1986). We conclude that our current model for temporal low-pass filtering by area 17 simple cells is both biologically plausible and computationally powerful, although contributions from other mechanisms such as spatiotemporally oriented filters (McLean and Palmer 1989), or inhibition by high-pass cells (Duysens et al. 1985b), cannot be excluded.

Appendix 0

Half-wave rectification L

0

n-function g ( t ) with time to peak o

0

Geniculate input

W(t) Wi(t)

=

Si(X.y)

=

Ti(t)

=

=

* = convolution, i = center, surround, sa = spontaneous activity contribution, K = 1, L = 0.6 (sustained) or 0.9 (transient cells), crCenter = 0.3 deg (sustained) or 0.6 deg (transient cells), gsurround = 3 gcenterThe biphasic impulse function T generates rebound responses in both center and surround, and a temporal as well as velocity band-pass behavior. In the n-functions @,,I and g;,z): Qcenter.1 = 8 msec, usurround,l = 16 msec, a1.2= 2 Q , . I . Spikes are generated by a Poisson process (Worgotter and Koch 1991): P(spike, t)

= poAtW(t)

The peak firing frequency of the geniculate input was about 500 spikes/sec for sustained input and about 800 spikes/sec for the transient input. Different strengths of this geniculate input were obtained (Fig. 3C) by changing the value of PO. At = 0.5 or 0.25 msec.

Cortical Low-Pass Filtering 0

943

Cortical cells

r e s ~ > o ~2 ~msec, ~ ~ elseP(spike.t ) P(spike,f ) = ~ ~ ~ t ~ ~ ~ ( t ) - E t ifh f-tsplke = 0. V = membrane potential in mV relative to rest, Ethreshold = 15 mV, E,,, = 100 mV, Elnh = -30 mV, E A H p = -30 mV, T = 16 msec, gleak = 0.125 /is.The following conductances (with time to peak f v and weight factor w between brackets) were used for Figure 2C and G: fast (0 = 2 msec, w = 15) and slow (80 msec, 30) inhibition, fastdecaying (2 msec, 10) and medium-duration (32 msec, 20) AHPs in excitatory neurons, fast-decaying (1 msec, 20) AHP in inhibitory neurons, geniculocortical (1 msec, 0.8) excitation and corticocortical (16 msec, 15) excitation. 0

From these parameter values and from the computed cortical cell response, the ratio of the corticocortical (CC) over the geniculocortical (GC) conductances can be derived as (weight CC synapse) x (number of CC synapses) (weight GC synapse) x (number of GC synapses) with (number of GC synapses) =

(response geniculate afferent pool) (response single LGN afferent)

and, using the results of Orban et al. (1985), (response single LGN afferent) = 2 x (response single cortical cell) For the simulations presented in Figure 2C, this yields 15.0 x 1.0 = 8.08 0.8 x [515/(2 x 111)]

Acknowledgments We are indebted to G. Meulemans for preparing the figures, to M. Van Hulle for his advice concerning numerical problems, and to T. Tollenaere and S. Raiguel for reading the manuscript. This work was supported by Grant RFO/Al/Ol from the Belgian Ministry of Science to G.A.O.

944

R. Maex and G. A. Orban

References

Baker, C. L, Jr. 1988. Spatial and temporal determinants of directionally selective velocity preference in cat striate cortex neurons. J. Neuropkysiol. 59, 15571574. Cannon, S. C., Robinson, D. A,, and Shamma, S. 1983. A proposed neural network for the integrator of the oculomotor system. Bid. Cybern. 49, 127136. Connors, B. W., Gutnick, M. J., and Prince, D. A. 1982. Electrophysiological properties of neocortical neurons in vitro. J. Neurophysiol. 48, 1302-1320. Connors, B. W., Malenka, R. C., and Silva, L. R. 1988. Two inhibitory postsynaptic potentials, and GABAA and GABABreceptor-mediated responses in neocortex of rat and cat. J. Physiol. (London) 406, 443468. Derrington, A. M., and Fuchs, A. F. 1979. Spatial and temporal properties of X and Y cells in the cat lateral geniculate nucleus. J. Physiol. (London) 293, 347-364. Derrington, A. M., and Lennie, P. 1984. Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. J. Physiol. (London) 357, 219-240. Douglas, R. D., and Martin, K. A. C. 1991. A functional microcircuit for cat visual cortex. J. Physiol. (London) 440, 735-769. Duysens, J., Orban, G. A., and Cremieux, J. 1985a. Velocity selectivity in the cat visual system. 11. Independence from interactions between different loci. J. Newophysiol. 54, 1050-1067. Duysens, J., Orban, G. A., Cremieux, J., and Maes, H. 1985b. Velocity selectivity in the cat visual system. 111. Contribution of temporal factors. J. Neurophysiol. 54, 1068-1 083. Gulyas, B., Lagae, L., Eysel, U., and Orban, G. A. 1990. Cortifugal feedback influences the responses of geniculate neurons to moving stimuli. Exp. Bruin Res. 79, 441-446. Hagihara, K., Tsumoto, T., Sato, H., and Hata, Y. 1988. Actions of excitatory amino acid antagonists on geniculo-cortical transmission in the cat’s visual cortex. Exp. Bruin Res. 69, 407-416. Hicks, T. P., Lee, B. B., and Vidyasagar, T. R. 1983. The responses of cells in macaque lateral geniculate nucleus to sinusoidal gratings. J. Physiol. (London) 337,183-200. Jack, J. J. B., Noble, D., and Tsien, R. W. 1975. Electrical current flow in excitable cells. Clarendon Press, Oxford. Jones, K. A,, and Baughman, R. W. 1988. NMDA- and non-NMDA-receptor components of excitatory synaptic potentials recorded from cells in layer V of rat visual cortex. J. Neurosci. 8, 3522-3534. Lee, B. B., Elepfandt, A., and Virsu, V. 1981. Phase responses to sinusoidal gratings of simple cells in cat striate cortex. J. Neurophysiol. 45, 818-828. Maex, R., and Orban, G. A. 1991. Subtraction inhibition combined with a spiking threshold accounts for cortical direction selectivity. Proc. Nutl. Acad. Sci. U.S.A. 88, 3549-3553.

Cortical Low-Pass Filtering

945

Martin, K. A. C. 1988. From single cells to simple circuits in the cerebral cortex. Quaf.J. Exp. Physiol. 73, 637-702. McLean, J., and Palmer, L. A. 1989. Contribution of linear spatiotemporal receptive field structure to velocity selectivity of simple cells in area 17 of cat. Vision Res. 29, 675479. Miller, K. D., Chapman, B., and Stryker, M. P. 1989. Visual responses in adult cat visual cortex depend o n N-methyl-D-aspartate receptors. Proc. Nntl. Acnd. Sci. USA 86,5183-5187. Movshon, J. A., Thompson, 1. D., and Tolhurst, D. J. 1978. Spatial and temporal contrast sensitivity of neurones in areas 17 and 18 of the cat's visual cortex. I. Ph!/siol. (London) 283, 101-120. Orban, G. A. 1984. Neuronal operations in the visual cortex. In Sfndirs of Braiii F~inction. Vol. 2 2 , H. B. Barlow, T. H. Bullock, E. Florey, 0. J. Griisser, and A. Peters, eds. Springer-Verlag, Berlin. Orban, G. A. 1991. Quantitative electrophysiology of visual cortical neurones. In Vision and Visual Dysflinctioii. Vol. 4. The Naira/ Basis of Visual Fccnction, J. Cronly-Dillon, Gen. ed., and A. G. Leventhal, ed. Macmillan, London. Orban, G. A., Kennedy, H., and Maes, H. 1981a. Response to movement of neurons in areas 17 and 18 of the cat: velocity sensitivity. J. Neirrophysiol. 45, 1043-1 058. Orban, G. A., Kennedy, H., and Maes, H. 1981b. Response to movement of neurons in areas 17 and 18 of the cat: Direction selectivity. J. Nwrophysiol. 45, 1059-1073. Orban, G. A., Hoffmann, K.-I?, and Duysens, J. 1985. Velocity selectivity in the cat visual system.1.Responses of LGN cells to moving bar stimuli: A comparison with cortical areas 17 and 18. J. Neurophysiol. 54, 1026-1049. Orban, G. A,, Kennedy, H., and Bullier, J. 1986. Velocity sensitivity and direction selectivity of neurons in areas V1 and V2 of the monkey: Influence of eccentricity. J. Neicrophysid. 56, 462480. Palmer, L. A., and Davis, T. L. 1981. Receptive-field structure in cat striate cortex. J. Neurophysiol. 46, 260-276. Schwindt, P. C., Spain, W. J., Foehring, C. E., Stafstrom, C. E., Chubb, M. C., and Crill, W. E. 1988. Multiple potassium conductances and their functions in neurons from cat sensorimotor cortex in vitro. J. Neurophysiol. 59, 424449. Sillito, A. M. 1975. The contribution of inhibitory mechanisms to the receptive field properties of neuroiies in the striate cortex of the cat. J. Physiol. (London) 250, 305-329. Worgotter, F., and Holt, G. 1991. Spatiotemporal mechanisms in receptive fields of visual cortical simple cells: a model. J. Neurophysiol. 65, 494-510. Worgotter, F., and Koch, C. 1991. A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neurosci. 11, 1959-1979. -~

Received 18 June 1991; accepted 1 December 1991

This article has been cited by: 2. Paul Mineiro , David Zipser . 1998. Analysis of Direction Selectivity Arising from Recurrent Cortical InteractionsAnalysis of Direction Selectivity Arising from Recurrent Cortical Interactions. Neural Computation 10:2, 353-371. [Abstract] [PDF] [PDF Plus] 3. Stefan Wimbauer, Wulfram Gerstner, J. Leo Hemmen. 1994. Emergence of spatiotemporal receptive fields and its application to motion detection. Biological Cybernetics 72:1, 81-92. [CrossRef]

Communicated by Stephen Gallant

A ”Thermal” Perceptron Learning Rule Marcus Frean Physiological Laboratory, Downing Street, Cambridge CB2 3EG, England

The thermal perceptron is a simple extension to Rosenblatt’s perceptron learning rule for training individual linear threshold units. It finds stable weights for nonseparable problems as well as separable ones. Experiments indicate that if a good initial setting for a temperature parameter, To, has been found, then the thermal perceptron outperforms the Pocket algorithm and methods based on gradient descent. The learning rule stabilizes the weights (learns) over a fixed training period. For separable problems it finds separating weights much more quickly than the usual rules. 1 Introduction

This paper is about a learning rule for a linear threshold unit, often called a perceptron. This is a unit connected by variable weights to a set of inputs across which patterns occur. In the following the ith element of an input pattern (a binary or real number) is denoted by <;, and its In response to a given pattern, a perceptron associated weight by W;. goes into one of two output states given by

where N

The extra input €0, taken to be 1 for every pattern, is commonly included as it enables the zeroth weight to act as a bias. The learning task for such units is to find weights such that the perceptron’s output o matches a desired output t for each input pattern [. On presentation of the lrth input pattern the perceptron learning rule (Rosenblatt 19621, henceforth abbreviated to PLR, alters the weights in the following way: PLR : A Wr = o (t” - d‘))Ef Neural Computation 4,946-957 (1992)

@ 1992 Massachusetts Institute of Technology

“Thermal” Perceptron Learning Rule

947

where ( I is a positive constant.’ For instance, if the output u elicited by a given input pattern is 0 whereas t is 1, this rule alters weights to increase the value of o. If repeated this would eventually result in the output becoming 1. The perceptron convergence theorem (Block ct al. 1962, Minsky and Papert 1969) states that if a set of weights exists for which the perceptron makes no errors, the PLR will converge on a (possibly different) set making no errors after only a finite number of pattern presentations. Hence perceptrons can learn any classification for which appropriate weights exist: such training sets are termed ”linearly separable.” A far more common, and in many ways more interesting, task is to find a set of weights that gives a minimal number of errors in the case where perfect weights d o not exist. Clearly if the PLR is used in such cases, the weights are never stable since they change every time an error is made. Neither are they “good on average,” as will be seen in Section 4. The task of finding stable weights that engender a small number of errors in this ”nonseparable” case is addressed in this paper. 2 The Thermal Perceptron Rule

One rationale goes as follows: the trouble is that the PLR does the same thing for every error made. Instead, the benefit from improving the value of Q elicited by a given input pattern should be tempered by the possibility that the new weights now misclassify patterns they previously got right. Since the change in 0 due to the effect of the PLR is independent of the value itself (other than its sign), an error with a large associated o is less likely to be corrected in this step than an error where cl is small. Conversely the weight changes that would be necessary to correct a large error are themselves large, and hence much more likely to corrupt the existing correct responses of the unit. Ideally the weight changes made should be biased toward correcting errors for which 0 is close to zero. A simple way to do this is to make the weights changes given by Rosenblatt’s PLR tail off exponentially with lei:

The parameter T controls how strongly the changes are attenuated for large lc/)l. T is somewhat analogous to a temperature: at high T the PLR is recovered since the exponential becomes almost unity for any input, whereas at low T there are no appreciable changes unless I’J is close to zero. If T is very small the weights are frozen at their current values. A natural extension is to anneal this effect by gradually reducing the temperature from high T , where the usual PLR behavior is seen, toward ‘Note that if weights are initially all zero, then the magnitude of because of the threshold function.

(I

is irrelevant

948

Marcus Frean

zero. This gradual freezing is particularly desirable because it stabilizes the weights in a natural way over a finite learning period. If T is held constant, convergence of the thermal PLR can be deduced (Frean 1990b) from the perceptron convergence theorem. That is, Minsky and Papert’s proof can be modified to hold where weight changes have the correct sign and use a positive step size, ot,chosen arbitrarily between bounds 0 < u 5 of 5 b. One picture of the way this rule works is given by considering the patterns as points in a space in which each dimension corresponds to the activity of a particular input line. Points in this space at which the perceptron’s output changes define a decision surface that (from equation 1.1) is a hyperplane perpendicular to the weights vector in this space. In the usual PLR, this hyperplane moves by approximately the same amount whenever there is an error, whereas in the thermal PLR it moves by an amount that is appreciable only if the pattern causing the error is close to the hyperplane (i.e., where 141 is small). As an approximation one can imagine a zone immediately to either side of the hyperplane within which an error will cause it to move. The perceptron will be stable if there are no errors occurring in this zone. The annealing of the temperature is then the gradual reduction of the extent of the ”sensitive” zone. In the limit of T + 0 the zone disappears altogether and the perceptron is stable. As will be seen, best results are obtained if (t is reduced from 1 to 0 at the same time as T is. In terms of the simplified picture above, this means that we reduce the amount the hyperplane moves in reaction to a pattern within the zone near the hyperplane in proportion to the size of the zone. 3 Other Methods

The two established ways of learning the weights for a threshold perceptron are reviewed here. In the next section the thermal rule is compared with these methods. 3.1 The Pocket Algorithm. A simple extension of perceptron learning for nonseparable problems called the Pocket algorithm (Gallant 1986a) suffices to make the PLR well behaved, in the sense that weights that minimize the number of errors can be found. This method is currently the preferred one for learning weights in threshold perceptrons, and is used in a number of applications. The Pocket algorithm consists of applying the PLR with a randomly ordered presentation of patterns, but also keeping a copy of a second set of weights ”in your pocket.” If the patterns are presented in random order, good sets of weights have a lower probability of being altered at each presentation, and therefore tend to remain unchanged for a longer time than sets of weights that engender

”Thermal” Perceptron Learning Rule

949

more errors. Whenever a set of perceptron weights lasts longer than the current longest run, they are copied into the pocket. As the training time increases, the pocketed weights give the minimum possible number of errors with a probability approaching unity. That is, if a solution giving say p or fewer errors exists then the Pocket algorithm can, in theory, be used to find it. Note that if the patterns are linearly separable, the algorithm reduces to the usual PLR. Gallant (1986a) notes that the Pocket algorithm outperforms a standard technique [Wilks’ method, in (SPSS-X 1984)l by about 20% for a set of 15 learning problems. In this form the algorithm improves the pocketed weights in an entirely stochastic fashion-there is nothing to prevent a good set of weights being overwritten by an occasional long run of successes involving bad weights. The ”ratchet” version of the Pocket algorithm (Gallant 1990) ensures that every time the pocket weights change, they actually improve the number of errors made. Whenever a set of weights is about to be copied into the pocket, the actual number of errors these weights engender if every pattern in the training set were to be applied to the input is compared with the same number for the existing pocket weights. The weights in the pocket are replaced only if the actual number of errors is lower. The main strength of the Pocket algorithm is the fact that optimal weights are found with probability approaching unity, given sufficient training time. Unfortunately there is no bound known for the training time actually required to achieve a given level of performance. Moreover, in practice the weights do not improve much beyond the first few cycles through the training set. Although the weights so obtained are better than many other methods, they still tend to make many more errors than an optimal set would make. To get better weights from the same procedure the “ratchet” is required. A potential disadvantage of this is that the computational cost is greatly increased in cases where there are a lot of weight vectors that are almost as good as those in the pocket (i.e., where the ratchet is likely to be invoked frequently). 3.2 Methods Based on Gradient Descent. Another way to generate weights for a threshold perceptron is to learn them by a gradient descent procedure. To do this requires adopting some differentiable function such as the ”sigmoid” or “logistic” function, which approximates the perceptron’s output function:

y = 1/(1+ e-”) Gradient descent attempts to minimize the total error attributable to the unit:

Marcus Frean

950

where E P is some measure of how well ylL matches the desired output f” for the pth input pattern. This error is reduced by moving down an estimate of the gradient of the error (namely that of E P rather than E ) with respect to W:

This approximates true gradient descent of E provided cr is very small (typically 0.001). One common form used for EL‘ is the squared error N

from classical statistics, giving rise to delta rule or “least-mean-squared” ( L M S ) methods (Widrow and Hoff 1960). This gives the following rule: 1 - I$)’ LMS: A” W; = Q (t” - y”)yLL(

[r

Another form is the cross-entropy error measure

from information theory. Gradient descent of this measure gives rise to a different rule (Hinton 1989): Cross-entropy: A” W, = cy (t”

-

y”) [:

which is just the PLR except that the sigmoid y” replaces the threshold 0”. 4 Simulations

Simulations were performed comparing the thermal PLR, Pocket algorithm, LMS, and cross-entropy on four classification tasks. The training sets in these tasks vary from being highly nonseparable to linearly separable. For the thermal rule both T and cr are reduced linearly from their starting values (To and 1, respectively) to zero over the stated training time, measured in epochs. One epoch consists of the presentation of P patterns chosen at random (with replacement) from a training set of size P. The simulations also show the other three combinations of annealing T and N. For the purposes of comparison, the optimal value of N for LMS and cross-entropy was found by exhaustive search for each task. This optimal value is used to compare these methods with the thermal rule. A notable advantage of the Pocket algorithm is that it requires no such tuning of parameters. For a given type of problem, finding a good value of To for the thermal rule can take considerable time. The same is true of Q in the LMS and cross-entropy rules, although the exact choice is typically less crucial for these rules. It is important to note that improvements in performance are obtained at the cost of this preliminary parameter search.

”Thermal” Perceptron Learning Rule

951

Regarding actual computational speeds, all the algorithms discussed here have similar run times except for the ratchet version of the Pocket algorithm. The Pocket algorithm has the advantage that it requires only integer operations whereas the other methods must calculate exponentials in order to alter the weights. On the other hand it must occasionally copy weights to the pocket as well as change those actually in use. In these simulations, run times2 of the thermal rule and the Pocket algorithm (without ratchet) never differ by more than 10%. Also note that the Pocket algorithm and thermal PLR calculate and make weight changes only if there is an error, whereas the rules based on gradient descent do this for every pattern presentation. This effect shows u p particularly for ”easy” problems, where the Pocket algorithm and thermal rule alter their weights relatively infrequently. At most this resulted in a 50% slowing of the gradient descent rules relative to the others, for the problems reported here. 4.1 Randomly Targeted Binary Patterns. In this task all 1024 of the binary patterns across 10 inputs are used, and each is assigned a target 0 or 1 with equal probability. Clearly this problem is highly nonseparable. Figures 1 and 2 show the performance of the various rules on this problem versus the initial temperature, To, used in the thermal PLR. Each trial consists of generating a random training set and, starting with zero weights, training a perceptron for 100 epochs. The performance is simply the proportion of patterns classified correctly to the total number of patterns. For the gradient descent methods the optimal value of (P for this problem and this training period is The ratchet version of the Pocket algorithm takes on average 6 times longer for this problem. Standard deviations are of the order of 0.02 for all points and all algorithms. Note that the PLR corresponds to the ”no annealing” curve at high To, which shows poor performance. 4.2 Discriminating Two Overlapping Gaussian Clusters. In this experiment the training sets consist of 100 real-valued patterns across 10 inputs. To generate a training set, two points inside the unit hypercube were chosen at random, and rescaled to be a unit distance apart. These were taken to be the centers of two gaussian probability distributions of variance 2. 50 patterns whose target output is 1 were generated at random from the first such distribution, and 50 patterns with target 0 from the second. For LMS and cross-entropy the optimal value of f t was found to be 0.5 and 0.001 respectively. Figure 3 shows the performance of each method on this task. Both LMS and the thermal PLR do much better than the Pocket algorithm. The ratchet slows down the Pocket algorithm by three times for this problem but significantly improves performance. *Note however that some implementations may exploit the advantages of integer arithmetic available to the Pocket algorithm to a greater extent than that used here.

952

Marcus Frean

Figure 1: Performance on the random targets problem versus starting temperature. Half the patterns are target 1, so a unit with zero valued weights gets 50% of the patterns correct by default. Each point is the mean of 1000 independent trials, each on a different training set. The levels attained by gradient descent methods and the Pocket algorithm are indicated by the shaded regions. The two gradient descent rules achieve virtually the same performance.

4.3 Convergence Time for Separable Training Sets. Where the training set is linearly separable the question is how quickly a set of separating weights can be found. Again all 1024 binary patterns across 10 inputs were included. The targets are then defined by inventing a perceptron with random weights, and adjusting its bias so that exactly half the total number of patterns are ON. Figure 4 shows how many trials out of 100 converged, where each algorithm was run for the indicated number of iterations. To for the thermal rule is set at 1.5 here, but values very much higher (e.g., TO= 100) are equally good. The value of a used for LMS and cross-entropy is 0.1 which is the optimal value for LMS. If a is very much larger than this, the cross-entropy rule effectively approaches the PLR. This is because the weights are correspondingly large, so the sigmoid is almost always in its ”saturated” range, mimicking a threshold unit. If there is no annealing, the thermal and Rosenblatt’s PLR (and hence the Pocket algorithm) have very similar performance. Annealing results in a striking improvement (an order of magnitude) in the speed of convergence over the other methods.

"Thermal" Perceptron Learning Rule

W

26 0.58 % 0 .s 0.56 C

953

thermal PLR

0 anneal alpha only 0

A

anneal T only no annealing

.-

2

u

0 0.54 C

0 .r

8 0.52

e

a 0.50

0.0

0.5

1.o

1.5

2.0

2.5

To, starting temperature

Figure 2: Performance on the random targets problem, for the various combinations of annealing, versus starting temperature.

4.4 Nearly Separable Training Sets: Dealing with Outliers. In many problems of interest, the patterns would be linearly separable were it not for a smalI number of outliers that prevent the PLR from converging. In such cases we want to ignore the outliers and find the weights that would correctly classify the linearly separable patterns. Training sets for this example were generated in the following way: first, a linearly separable training set is produced as above. Then some small number n of the patterns are chosen at random and their targets flipped from 1 to 0 or vice versa. Hence the problem would be linearly separable were it not for these patterns; and a good learning rule should produce units giving at most n errors. Parameter values are the same as in the separable case above. Results are shown in Figure 5. Inclusion of the ratchet slows the Pocket algorithm by an average 300 times (for n = l), indicating that a great many long runs with suboptimal weights are occurring. Note that the thermal rule performs slightly better than the level attainable by ignoring the flipped patterns, indicating that it occasionally finds a better set of weights than those originally used to generate the patterns. Although To is 1.5 here, virtually the same performance is obtained for starting temperatures u p to To = 30.

31f 11 is small the problem may still be linearly separable: such cases were rejected.

954

Marcus Frean

Figure 3: Performance on the overlapping gaussians problem versus starting temperature. Performance is the proportion of the training set correctly classified after 100 epochs. Each point is an average over 1000 trials. Standard deviations are of the order of 0.04 for all points. The mean performance of the PLR on this problem is 0.588.

5 Discussion The thermal PLR, which is closely related to the classical PLR, outperforms both the Pocket algorithm and methods based on gradient descent in terms of efficiently generating weights that give a small number of errors in a threshold perceptron. If the patterns are linearly separable, perfect weights are located quickly. In addition it produces stable weights in a given training time. It is interesting to compare the rationale behind the thermal rule with that of the LMS procedures. In the case of the thermal rule, it is argued

"Thermal" Perceptron Learning Rule

-& P C 8 -2

955

100

90

0

thermal PLR thermal PLR, no annealing

70

A

.PI R

-

60

0 LMS A Cross-entropy

5

40

5

30

80

-

Y

Y

h

o

2 0

50

v

a

z

20

10 0

0

200

400

600

800

1000

Training time in epochs

Figure 4: The graph shows how many out of 100 independent trials (each on a different training set) converged to zero errors after the training time denoted by the abscissa. For example, after 200 epochs 60% of runs using the PLR had converged.

that large errors (in 4)should be penalized lightly, since endeavoring to correct these errors means corrupting existing weights to a large degree. In LMS, large errors are penalized more heavily than small ones, since these large errors contribute proportionally more to the quantity being minimized (the sum of squared errors). Hence these two approaches have opposite motivations. The twin constraints of stability and optimality of weights are of particular significance for coizstructive algorithms, in which perceptrons are added one at a time, enabling eventual convergence of the whole network to zero errors. At present the Pocket algorithm is the procedure of choice for most methods of this type (e.g., Gallant 1986b; Mezard and Nadal 1989; Nadal 1989; Golea and Marchand 1990). The thermal rule has been applied to this type of algorithm (Frean 1990a; Burgess et al. 1991; Martinez and Esteve 19921, and can dramatically reduce the size of the networks so produced (by between 2 and 5 times for a range of problems), resulting in better generalization and computational efficiency (Frean 1990b). In addition more difficult problems may now be successfully tackled with the same constructive algorithms.

956

Marcus Frean

Figure 5: The number of errors made after 100 epochs is plotted against the number of patterns whose targets were flipped from an initially separable training set. For instance, if a single target is flipped the thermal PLR always makes one error whereas the other methods make 8 errors. In the shaded region the number of errors is less than the number of outliers. Each point is the average of 200 trials, each on a different training set. Where not shown the error bars are smaller than the points.

-

Acknowledgments The author would like to thank Peter Dayan, David Willshaw, and Jay Buckingham for their helpful advice in the preparation of this paper.

References Block, H. D., Knight, B. W. Jr, and Rosenblatt, F. 1962. Analysis of a four-layer series-coupled perceptron (11). Rev. Modern Phys. 34(1),135-142. Burgess, N., Granieri, M. N., and Patarnello, S. 1991. 3-D object classification: Application of a constructor algorithm. Int. J. Neural Syst. 2(4), 275-282. Frean, M. R. 1990a. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2(2), 198-209. Frean, M. R. 1990b. Small nets and short paths: Optimising neural computation. Ph.D. thesis, University of Edinburgh.

"Thermal" Perceptron Learning Rule

957

Gallant, S. I. 1986a. Optimal linear discriminants. IEEE Proc. 8th Coitf. Pattern

Recognition, Paris. Gallant, S. I. 1986b. Three constructive algorithms for network learning. Proc.

8th Annual Conf. Cognitive Sci. SOC. Gallant, S. I. 1990. Perceptron-based learning algorithms. E E E Transact. Neural Networks 1(2), 179-192. Golea, M., and Marchand, M. 1990. A growth algorithm for neural network decision trees. Europhys. Lett. 12(3), 205-210. Hinton, G. E. 1989. Connectionist learning procedures. Artificinl Intelligence 40, 185-234. Martinez, D., and Esteve, D. 1992. The offset algorithm: Building and learning method for multilayer neural networks. Europhysics Lett. 18(2), 95-100. Mezard, M., and Nadal, J-P. 1989. Learning in feedforward layered networks: The tiling algorithm. 1. Pltys. A, 22(12), 2191-2203. Minsky, M., and Papert, S. 1969. Perceptrons. The MIT Press, Cambridge, MA. Nadal, J-P. 1989. Study of a growth algorithm for neural networks. I n t . 1. Neural Syst. 1(1), 55-59. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. IRE WESCON Convention Record, New York: IRE, 96-104.

Received 12 August 1991; accepted 24 April 1992

This article has been cited by: 2. Haralampos-G. Stratigopoulos, Yiorgos Makris. 2008. Error Moderation in Low-Cost Machine-Learning-Based Analog/RF Testing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27:2, 339-351. [CrossRef] 3. John W. Chinneck. 2001. Fast Heuristics for the Maximum Feasible Subsystem Problem. INFORMS Journal on Computing 13:3, 210-223. [CrossRef] 4. R. Parekh, J. Yang, V. Honavar. 2000. Constructive neural-network learning algorithms for pattern classification. IEEE Transactions on Neural Networks 11:2, 436-451. [CrossRef] 5. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef] 6. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus] 7. Yu G. Smetanin. 1998. Neural networks as systems for recognizing patterns. Journal of Mathematical Sciences 89:4, 1406-1457. [CrossRef] 8. Mauro Copelli, Osame Kinouchi, Nestor Caticha. 1996. Equivalence between learning in noisy perceptrons and tree committee machines. Physical Review E 53:6, 6341-6352. [CrossRef] 9. Bruno Raffin , Mirta B. Gordon . 1995. Learning and Generalization with Minimerror, A Temperature-Dependent Learning AlgorithmLearning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm. Neural Computation 7:6, 1206-1224. [Abstract] [PDF] [PDF Plus] 10. A Wendemuth. 1995. Journal of Physics A: Mathematical and General 28:19, 5485-5493. [CrossRef] 11. A Wendemuth. 1995. Journal of Physics A: Mathematical and General 28:18, 5423-5436. [CrossRef] 12. M. B Gordon, D. R Grempel. 1995. Learning with a Temperature-Dependent Algorithm. Europhysics Letters (EPL) 29:3, 257-262. [CrossRef] 13. Ronny Meir . 1995. Empirical Risk Minimization versus Maximum-Likelihood Estimation: A Case StudyEmpirical Risk Minimization versus Maximum-Likelihood Estimation: A Case Study. Neural Computation 7:1, 144-157. [Abstract] [PDF] [PDF Plus]

958

Acknowledgment to Reviewers The editors of Neural Computation decide on the suitability of manuscripts for publication and help authors improve the presentation of accepted papers. In addition to our editors, many colleagues have helped us with their advice and reviews over the first four years of publication. We are very grateful for their support. Yaser Abu-Mostafa, California Institute of Technology Paul Adams, State University of New York Ted Adelson, Massachusetts Institute of Technology John Allman, California Institute of Technology Joseph Atick, The Institute for Advanced Study, Princeton, NJ Pierre Baldi, California Institute of Technology William Baxt, University of California, San Diego Medical Center Susan Becker, University of Toronto, Canada Richard Belew, University of California, San Diego A. B. Bonds, Vanderbilt University Herve Bourlard, International Computer Science Institute, Berkeley, CA James Bower, California Institute of Technology Oliver Braddick, University of Cambridge, England Peter Brahm, Merton College, England Thomas Brown, Yale University James Buchanan, Marquette University Heinrich Bulthoff, Brown University Theodore Bullock, University of California, San Diego Paul Bush, The Salk Institute Jack Byrne, University of Texas Medical School Garrison Cottrell, University of California, San Diego Antonio Damasio, University of Iowa Hospitals and Clinics Peter Dayan, The Salk Institute John Denker, AT&T Bell Laboratories, Holmdel, NJ Nelson Donegan, Yale University Martin Egelhaaf, Max-Planck Institut fur biologische Kybernetik, Germany Bard Ermentrout, National Institutes of Health Jack Feldman, University of California, Los Angeles David Field, Cornell University Merrick Furst, Carnegie Mellon University Henrietta Galiana, McGill University, Canada Stephen Gallant, HNC, Inc., Cambridge, MA Bard Geesaman, Massachusetts Institute of Technology Peter Getting, University of Iowa C. Lee Giles, NEC Research Institute, Princeton, NJ Leon Glass, McGill University, Canada Rodney Goodman, California Institute of Technology

959

Richard Granger, University of California, Irvine Paul Grobstein, Bryn Mawr College Stephen Hanson, Siemens Research Center, Princeton, NJ David Haussler, University of California, Santa Cruz Simon Haykin, McMaster University, Canada Robert Hecht-Nielsen, HNC, and University of California, San Diego Walter Heiligenberg, University of California, San Diego Robert Hummel, New York University J. Stephen Judd, Siemens Research Center, Princeton, NJ James Keeler, Microelectronics and Computer Technology Corporation, Austin, TX Peter Kennedy, University of California, Berkeley Nancy Kopell, Boston University Paul Kube, University of California, San Diego Vera Kurkova, Czechoslovak Academy of Sciences, Prague, Czechoslovakia Alan Lapedes, Los Alamos National Laboratory Yann LeCun, AT&T Bell Laboratories, Holmdel, NJ Todd Leen, Oregon Graduate Institute Sidney Lehky, University of Rochester Ralph Linsker, IBM Research Center, Yorktown Heights, NY Shawn Lockery, The Salk Institute David Lowe, Royal Signals and Radar Establishment, England Stephen Luttrell, Royal Signals and Radar Establishment, England William Lytton, The Salk Institute David MacKay, Cavendish Laboratory, Radio Astronomy, England Donald MacLeod, University of California, San Diego Michelle Mahowald, California Institute of Technology Olvi L. Mangasarian, University of Wisconsin Eve Marder, Brandeis University Bruce McNaughton, University of Arizona Bartlett Mel, California Institute of Technology Kenneth Miller, University of California, San Francisco Douglas Miller, McGill University, Canada K. S. Narendra, Yale University Ernst Niebur, California lnstitute of Technology Steven Nowlan, The Salk Institute Klaus Obermayer, University of Illinois at Urbana-Champaign Giinther Palm, Universitat Ulm, Germany Ramamohan Paturi, University of California, San Diego J. D. Pearson, David Sarnoff Research Center, Princeton, New Jersey Carsten Peterson, University of Lund, Sweden John Platt, Synaptics, San Jose, California Mark Plutowski, University of California, San Diego Edward Posner, California Institute of Technology Alexandre Pouget, The Salk Institute

960

Ning Qian, Massachusetts Institute of Technology Ronald Rivest, Massachusetts Institute of Technology Charles Rosenberg, Hebrew University, Israel Peter Rowat, University of California, San Diego Eduard Saeckinger, AT&T Bell Laboratories, Holmdel, NJ Walter Savitch, University of California, San Diego Thomas Schillen, Max-Planck Institut fur Hirnforschung, Germany Jude Shavlik, University of Wisconsin, Madison Patrice Simard, AT&T Bell Laboratories, Holmdel, NJ Paul Smolensky, University of Colorado, Boulder Sara Solla, AT&T Bell Laboratories, Holmdel, NJ Charles Stevens, The Salk Institute Steve Suddarth, AFOSR/NM, Washington, D.C. Richard Sutton, GTE Labs, Waltham, MA David Tank, AT&T Bell Laboratories, Murray Hill, NJ Raoul Tawel, Jet Propulsion Laboratory, Pasadena, California Stefan Treue, Massachusetts Institute of Technology K. P. Unnikrishnan, General Motors Research Laboratories, Warren, MI Santosh Venkatesh, University of Pennsylvania, Philadelphia Paul Viola, Massachusetts Institute of Technology Alex Waibel, Carnegie Mellon University Andreas Weigand, Xerox Palo Alto Research Center Frank Werblin, University of California, Berkeley Steven Whitehead, University of Rochester Ronald Williams, Northeastern University Martin Woldorff, University of California, San Diego Alan Yuille, Harvard University Anthony Zador, Yale University David Zipser, University of California, San Diego

961

Index

-

Volume 4 By Author Alspector, J., Zeppenfeld, T., and Luna, S. A Volatility Measure for Annealing in Feedback Neural Networks (Note)

4(2):191-1 95

Amari, S., Fujita, N., and Shinomoto, S. Four Types of Learning Curves (Letter)

4(4):605-618

Anthony, D. M., Hines, E. L., Hutchins, D. A., and Mottram, J. T. Ultrasound Tomography Imaging of Defects Using Neural Networks (Letter)

4(5):758-771

Atick, J. J. and Redlich, A. N. What Does the Retina Know about Natural Scenes? (Letter)

4(2):196-21 0

Atick, J. J., Li, Z., and Redlich, A. N. Understanding Retinal Color Coding from First Principles (Letter)

4(4):559-572

Back, A. D. and Tsoi, A. C. An Adaptive Lattice Architecture for Dynamic Multilayer Perceptrons (Letter)

4(6):922-931

Battiti, T. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method (Review)

4(2):141-166

Baxt, W. Improving the Accuracy of an Artificial Neural Network Using Multiple Differently Trained Networks (Letter)

4(5):772-780

Beer, R. D., Chiel, H. J., Quinn, R. D., Espenschied, K. S., and Larsson, P. A Distributed Neural Network Architecture for Hexapod Robot Locomotion (Letter)

4(3):356-365

Behrmenn, M. - See Mozer, M. C. Bialek, W. - See Ruderman, D. L. Bienenstock, E. - See Geman, S.

962

Index

Bilbro, G. L. and Van den Bout, D. E. Maximum Entropy and Learning Theory (Letter)

4(6):839-853

Bishop, C. Exact Calculation of the Hessian Matrix for the Multilayer Perceptron (Note)

4(4):494-50 1

Bottou, L. and Vapnik, V. Local Learning Algorithms (Letter)

4(6)~888-900

Bourlard, H. - See Morgan, N. Bradski, G., Carpenter, G. A., and Grossberg, S. Working Memory Networks for Learning Temporal Order with Application to Three-Dimensional Visual Object Recognition (Letter)

4(2):279-286

Brown, V. - See LaBerge, D. Biilthoff, H. H. - See Kersten, D. Carpenter, G. - See Bradski, G. Carter, M. - See LaBerge, D. Chen, D. - See Giles, C. L. Chen, H. H. - See Giles, C. L. Chiel, H. J. - See Beer, R. D. Cohn, D. and Tesauro, G. How Tight Are the Vapnik-Chervonenkis Bounds? (Letter)

4(2):249-269

DAutrechy, C. L. - See Reggia, J. L. Daw, N. - See Fox, K. Doursat, R. - See Geman, S. Espenschied, K. S. - See Beer, R. D. Finkel, L. H. and Sajda, P. Object Discrimination Based on Depth-fromOcclusion (Letter) Fox, K. and Daw, N. A Model for the Action of NMDA Conductances in the Visual Cortex (Article) Frankel, P. - See Rinzel, J.

4(6):901-921

4(1):59-83

Index

963

Frasconi, P., Gori, M., and Soda, G. Local Feedback Multilayered Networks (Letter)

4(1):120-130

Frean, M. A "Thermal" Perceptron Learning Rule (Letter)

4(6):946-957

Fujita, N. - See Amari, S. Geman, S., Bienenstock, E., and Doursat, R. Neural Networks and the Bias/Variance Dilemma (View)

4(1) 1-58

Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z . , and Lee, Y. C. Learning and Extracted Finite State Automata with Second-Order Recurrent Neural Networks (Letter)

4(3):393-405

Gislkn, L., Peterson, C., and Soderberg, B. Rotor Neurons: Basic Formalism and Dynamics (Letter)

4(5):737-745

Gislkn, L., Peterson, C., and Soderberg, B. Complex Scheduling with Potts Neural Networks (Article)

4(6):805-83 1

Glass, L. - See Lewis, J. E. Goodman, R. M., Smyth, P., Higgins, C. M., and Miller, J. W. Rule-Based Neural Networks for Classification and Probability Estimation (Article)

4(6):781-804

Gori, M. - See Frasconi, P Grossberg, S. - See Bradski, G.

Higgins, C. M. - See Goodman, R. M. Hines, E. L. - See Anthony, D. M. Hinton, G. E. - See Nowlan, S. J. Hopfield, J. J. - See Unnikrishnan, K. P. Huberman, B. - See Lumer, E. D. Hutchins, D. A. - See Anthony, D. M. Intrator, N. Feature Extraction Using an Unsupervised Neural Network (Letter)

4(1):98-107

964

Index

Janosch, B. - See Konig, P. Jenkins, R. E. and Yuhas, B. P. A Simplified Neural-Network Solution through Problem Decomposition: The Case of the Truck Backer-Upper (Letter)

4(5):647-649

Kabashima, Y. and Shinomoto, S. Learning Curves for Error Minimum and Maximum Likelihood Algorithms (Letter)

4(5)~712-719

Kabrisky, M. - See Suter, B. Kersten, D., Biilthoff, H. H., Schwartz, 8.L., and Kurtz, K. J. Interaction between Transparency and Structure from Motion (Letter)

4(4):573-589

Koch, C. - See Softky, W. P. Koch, C. - See Worgotter, E Koch, C. and Schuster, H. A Simple Network Showing Burst Synchronization without Frequency Locking (Letter)

4(2):211-223

Konig, I?, Janosch, B., and Schillen, T. B. Stimulus-Dependent Assembly Formation of Oscillatory Responses: 111. Learning (Letter)

4(5):666-681

Kuhn, G. M. - See Watrous, R. L. Kurtz, K. J. - See Kersten, D. LaBerge, D., Carter, M., and Brown, V. A Network Simulation of Thalamic Circuit Operations in Selective Attention (Letter)

4(3):318-331

Larsson, P. - See Beer, R. D. Lee, Y. C. - See Giles, C. L. Lenz, R. and Osterberg, M. Computing the Karhunen-Loeve Expansion with a Parallel, Unsupervised Filter System (Letter)

4(3):382-392

Lewis, J. and Glass, L. Nonlinear Dynamics and Symbolic Dynamics of Neural Networks (Article)

4(5):621-642

Index

965

Li, Z . - See Atick, J. J. Linsker, R. Local Synaptic Learning Rules Suffice to Maximize Mutual Information in a Linear Network (Letter)

4(5):691-702

Lumer, E. D. and Huberman, B. A. Binding Hierarchies: A Basis for Dynamic Perceptual Grouping (Letter)

4(3):341-355

Luna, S. - See Alspector, J. Lyuu, Y. D. and Rivin, I. Tight Bounds on Transition to Perfect Generalization in Perceptrons (Letter) 4(6):854-862 MacKay, D. Bayesian Interpolation (Letter)

4(3):415-447

MacKay, D. A Practical Bayesian Framework for Backpropagation Networks (Letter)

4(3):448472

MacKay, D. Information-Based Objective Functions for Active Data Selection (Letter)

4(4):590-604

MacKay, D. The Evidence Framework Applied to Classification Networks (Letter)

4(5):720-736

Maex, R. and Orban, G. A. A Model Circuit for Cortical Temporal Low-Pass Filtering (Letter)

4(6):932-945

Mel, B. NMDA-Based Pattern Discrimination in a Modeled Cortical Neuron (Letter)

4(4):502-517

Miller, C. B. - See Giles, C. L. Miller, D. A. and Zucker, S. W. Efficient Simplex-Like Methods for Equilibria of Nonsymmetric Analog Networks (Article

4(2):167-190

Miller, J. W. - See Goodman, R. M. Morgan, N. and Bourlard, H. Factoring Networks by a Statistical Method Letter) Mottram, J. T. - See Anthony, D. M.

4(6):835-838

966

Index

Mozer, M., Zemel, R. S., Behrmann, M., and Williams, C. K. I. Learning to Segment Images Using Dynamic Feature Binding (Letter) 4(5):650-665 Murray, A. F. Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System (Letter)

4(3):366-381

Neal, R. M. Asymmetric Parallel Boltzmann Machines are Belief Networks (Note)

4(6):832-834

Niebur, E. - See Worgotter, F. Nowlan, S. J. and Hinton, G. E. Simplifying Neural Networks by Soft Weight-Sharing (Article)

4(4):473-493

Orban, G. A. - See Maex, R. Osterberg, M. - See Lenz, R. Palm, G. On the Information Storage Capacity of Local Learning Rules (Letter)

4(5)~703-711

Peterson, C. - See Gislen, L. Quinn, R. D. - See Beer, R. D. Rapp, M., Yarom, Y., and Segev, I. The Impact of Parallel Fiber Background Activity on the Cable Properties of Cerebellar Purkinje Cells (Letter)

4(4):518-533

Ray, W. H. - See Scott, G. M. Redlich, A. N. - See Atick, J. J. Reggia, J. A., DAutrechy, C. L., Sutton, G. G., and Weinrich, M. A Competitive Distribution Theory of Neocortical Dynamics (Article)

4(3):287-317

Rinzel, J. - See Wang, X. J. Rinzel, J. and Frankel, P. Activity Patterns of a Slow Synapse Network Predicted by Explicitly Averaging Spike Dynamics (Letter)

4(4):534-545

Index

967

Rivin, I. - See Lyuu, Y. D. Ruderman, D. L. and Bialek, W. Seeing Beyond the Nyquist Limit (Letter)

4(5):682-690

Sajda, P. - See Finkel, L. H. Schillen, T. - See Konig, P. Schmidhuber, J. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks (Letter)

4(1):131-139

Schmidhuber, J. Learning Complex, Extended Sequences Using the Principle of History Compression (Letter)

4(2):234-242

Schmidhuber, J. A Fixed Size Storage O(n3)Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks (Letter) 4(2):243-248 Schmidhuber, J. Learning Factorial Codes by Predictability Minimization (Letter)

4(6):863-879

Schuster, H. - See Koch, C. Schwartz, B. L. - See Kersten, D. Scott, G. M., Shavlik, J. W., and Ray, W. H. Refining PID Controllers Using Neural Networks (Letter)

4(5):746-757

Segev, I. - See Rapp, M. Shavlik, J. W. - See Scott, G. M. Shinomoto, S. - See Amari, S. Shinomoto, S. - See Kabashima, Y. Smyth, P. - See Goodman, R. M. Soda, G. - See Frasconi, I! Soderberg, B. - See Gislh, L. Softky, W. P. and Koch, C. Cortical Cells Should Fire Regularly, But Do Not (Note)

4(5):643-646

Index

968

Stubberud, A. R. - See Wabgaonkar, H. M. Sun, G. Z . - See Giles, C. L. Suter, B. and Kabrisky, M. On a Magnitude Preserving Iterative MAXnet Algorithm (Letter)

4(2):224-233

Sutton, G. G. - See Reggia, J. A. Tank, D. W. - See Unnikrishnan, K. P. Tesauro, G. - See Cohn, D. Tsoi, A. C. - See Back, A. D. Unnikrishnan, K. P., Hopfield, J. J., and Tank, D. W. Speaker-Independent Digit Recognition Using a Neural Network with Time-Delayed Connections (Letter) Van den Bout, D. E.

-

4( 1):lOB-119

See Bilbro, G. L.

Vapnik, V. - See Bottou, L. Wabgaonkar, H. M. and Stubberud, A. R. How to Incorporate New Pattern Pairs without Having to Teach the Previously Acquired Pattern Pairs (Letter)

4(6):880-887

Wang, X. J. and Rinzel, J. Alternating and Synchronous Rhythms in Reciprocally Inhibitory Model Neurons (Letter)

4(1):84-97

Watrous, R. L. and Kuhn, G. M. Induction of Finite-State Languages Using Second-Order Recurrent Networks (Letter)

4(3):406-414

Weinrich, M. - See Reggia, J. A. Williams, C. K. I. - See Mozer, M. C. Williams, T. Phase Coupling in Simulated Chains of Coupled Oscillators Representing the Lamprey Spinal Cord (Letter)

4(4):546-558

Worgotter, F., Niebur, E., and Koch, C. Generation of Direction Selectivity by Isotropic Intracortical Connections (Letter)

4(3):332-340

Yarom, Y. - See Rapp, M.

Index

Yuhas, B. - See Jenkins, R. E.

Zemel, R. S. - See Mozer, M. C. Zeppenfeld, T. - See Alspector, J. Zucker, S. - See Miller, D.

969

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

Recommend Documents

No title

Contents Introduction Dean Wesley Smith Star Trek® “A Test of Character” Kevin Lauderdale “Indomitable” Kevin Killiany “...

No title

METHODS IN CELL PHYSIOLOGY VOLUME I1 This Page Intentionally Left Blank Methods in Cell Physiology Edited by DAVID...

No title

No title

HYPATI A SPECIAL ISSUE FrenchFeministPhilosophy WINTER1989 A Journalof FeministPhilosophy HYPATI A SPECIAL ISSUE Fre...

No title

No title

No title

No title

JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS Aims & Scope The Journal of Ancien...

No title

6 INVESTIGACION Ingeniería hidráulica CIENCIA en el México prehistórico s. Christopher Caran y James E. Neely Ed ic...

No title